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Preface 



This volume of the LNCS series contains the papers accepted for presentation at the 
Third IFIP international working conference on active networks (I WAN 2001). The 
workshop was held at the Sheraton University City Hotel, in Philadelphia USA, and 
was hosted hy the University of Pennsylvania. 

Active networks aim to ease the introduction of network services by adding dynamic 
programmability to network devices such as routers, and making aspects of the 
programmability accessible to users. Active networks research has focused on the 
development and testing of active techniques, that enable dynamic programmability 
in a networked environment. These techniques have a wide variety of applications. 
At IWAN 2001 we aimed to bring together members of the various communities 
using active and related techniques, and provide a forum for discussion and 
collaboration, involving researchers, developers, and potential users. Papers 
presented at IWAN 2001 covered the application of active techniques to many 
aspects of network based communication, including active multicast, active QoS, 
active security, active GRIDs, and active management. In addition, there were papers 
on architectures, language, and API issues. Although there were only 22 
submissions, the standard of the 10 accepted papers was very high. This indicated 
clearly a substantial amount of ongoing high quality research in active networking, 
despite the current unfavorable economic conditions in the telecommunications 
industry. The papers also demonstrated that the research is genuinely global, and 
justifies an international workshop of this type. 

We would like to thank all the authors who submitted their work. Without them the 
workshop could not exist. The workshop is also indebted to the members of the 
program committee and reviewers who devoted their time and expertise to ensuring 
the high quality of the program, and who deserve our warm thanks and appreciation. 
Above all we would like to thank the organizing committee for their tireless efforts 
providing the administrative and organizational support that is essential for a 
successful conference. 
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IWAN 2001 

Message from the Conference General Chair 



It is an honor to host the Third Working Conference on Active Networks in 
Philadelphia. At this third IWAN, we have clearly demonstrated the broad 
international interest in this exciting area of research and technology, as IWAN will 
now have been held in Europe, Asia, and North America. It is fitting that we are sited 
in Philadelphia, as it has been the source of many revolutionary activities, such as the 
American independence two centuries ago, and the ENIAC fifty years ago (there is 
an ENIAC "mini-museum" in the University of Pennsylvania's Moore School at 33rd 
and Walnut Streets - please stop by if you have time). One can only hope that our 
revolutionary approach to networking will have similar impacts ! 

This is the proper place to acknowledge the outstanding work of our Program 
Committee, Chaired by Ian Marshall, Scott Nettles, and Naoki Wakamiya. They 
have produced a rigorously reviewed set of full papers for this conference, which I 
think, set a high bar for future IWANs. I hope you enjoy IWAN 2001 ! 

lonathan M. Smith 
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Abstract. Distributed Denial of Service (DDoS) attacks are a pressing problem 
on the Internet as demonstrated by recent attacks on major e-commerce servers 
and ISPs. Since their threat lies in the inherited weaknesses of the TCP/IP, an 
effective solution to DDoS attacks must be formulated in conjunction with a 
new networking paradigm, such as Active Networks. In this paper, we 
introduce a conceptual framework called Aegis, which we propose as a defense 
mechanism against DDoS attacks. The core-enabling technology of this 
framework is the Active Network, which incorporates programmability into 
intermediate network nodes and allows end-users to customize the way network 
nodes handle data traffic. By introducing Aegis, we also wish to demonstrate 
some of the new possibilities that the Active Networks can offer. 



1 Introduction 

Since February 2000, when a number of major commercial web sites such as Yahoo, 
CNN.com, E*TRADE, eBay, Buy.com and ZDNet were attacked and rendered 
useless for a period of time by DDoS (Distributed Denial of Service) attacks, the word 
'DDoS' has become part of the active vocabulary of most Internet users. The concept 
of DDoS attacks had been known for some time by networking experts, who had 
warned of their implied danger long before these incidents occurred. Until last year, 
however, most companies had treated DDoS as merely theoretical, and were willing 
to wait until something happened before taking any action. Today, DDoS is 
undoubtedly a pressing problem on the Internet, and its potential impact has been well 
demonstrated. Since its threat lies in the inherited weaknesses of the TCP/IP, an 
effective solution must be formulated in conjunction with a new networking 
paradigm, such as Active Networks. 

In this paper, we will introduce a conceptual framework called Aegis, which we 
propose as a defense mechanism against DDoS attacks. This framework has been 
designed on top of the Active Networks technologies, which incorporate 
programmability into intermediate network nodes (routers or switches) and allow end- 
users to customize the way network nodes handle data traffic. 

The remainder of this paper is organized as follows: Section 2 introduces some 
different types of DDoS attacks and existing countermeasures; Section 3 lists a 
number of our system requirements for the underlying platform that we intend to 
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exploit in defending against DDoS attacks; Section 4 explains Aegis, our proposed 
framework, in detail; Section 5 outlines our future work and identifies a number of 
known issues concerning Aegis; and Section 6 concludes this paper. 



2 DDoS Attacks 

2.1 Types of Attacks 

A Distributed Denial of Service (DDoS) attack is, as its name applies, a distributed 
form of Denial of Service (DoS) attack. A DoS attack is characterized by the 
deliberate act of sending a flood of malicious traffic to a server, thus depriving it of its 
resources so that it becomes unavailable to other legitimate users. While there are 
many different tools available to launch DoS attacks, the attacks themselves can be 
divided into two basic types. 

The first type of attack aims to crash the server OS or starve the server's system 
resources, such as its CPU utilization, file storage or memory, by exploiting flaws in 
the server software, security policy or TCP/IP implementation. Common examples of 
this type of attack include SYN Flood, IP Fragmentation Overlap, Windows NT 
Spool Leak and Buffer Overflow [1]. Prevention mechanisms against this type of 
attack, such as SYN cookie, have been well developed by various software vendors, 
and so system administrators can defend their sites by upgrading their server software 
or firewalls. 

The second type of DoS attack is rather more primitive, but also far more 
insidious. This time the attackers do not care what software the victim is using. 
Instead, they simply try to consume all available network bandwidth of the target 
network by bombarding it with massive amounts of traffic. Well-known examples 
include Smurf, Fraggle, ICMP Flood and UDP Flood [1]. Most DDoS attack tools, 
such as TFN, TFN2K, and Stacheldraht, intensify these bandwidth-consuming DoS 
attacks by launching them from multiple sources [2]. We have been particularly 
interested in this second type of attack because to date no existing technologies have 
been able to effectively tackle this problem. Aegis, the system we are proposing, 
focuses on defending against this type of attack. 



2.2 Existing Solutions and their Weaknesses 

A complete countermeasure against bandwidth-consuming DDoS attacks involves 
five stages - prevention, detection, first response, traceback and second response. 
Most current work related to DDoS attacks attempts to address the problems 
occurring in these individual stages. 

In the prevention stage, we want to stop DDoS attacks from being launched in the 
first place. This can be achieved by two means. One is to have all Internet users look 
for the presence of DDoS daemons in their own machines by using some scanning 
tools. Although this method prevents DDoS attackers to exploit unprotected hosts for 
malicious purposes, getting such large-scale cooperation from all Internet users is a 
huge bottleneck. Another prevention method is to use Cisco's Ingress Filter [3], 
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which is designed to be installed and configured in every edge router to verify the 
legitimacy of each outgoing packet's source address. Packets with IP prefixes not 
matching the edge router's network will not be allowed to go out. This technology 
prevents DDoS attackers from using forged source addresses. Ingress Filters have 
two major weaknesses: 1) at present not all edge routers are configured to enable this 
function. If the attacker's edge router does not have the Ingress Filter enabled, once 
packets with forged source addresses pass the edge router successfully, they are 
almost impossible to catch; 2) Ingress Filters do absolutely nothing to prevent non- 
spoofing flood. Unlike attacks that aim to consume server resources, bandwidth- 
consuming attacks are destructive even without spoofing source addresses. 

Because the current technologies on preventing DDoS attacks are limited, 
detection mechanism against DDoS floods plays an important role. In the detection 
stage, we want to distinguish 1) a DDoS flood from a high peak of regular traffic, and 
2) malicious packets from legitimate packets. This may be achieved by identifying a 
number of anomalies such as sudden bursts of traffic, oversized ICMP and UDP 
packets, or large number of packets with identical payloads. A number of NIDSs 
(Network Intrusion Detection System) have been developed to monitor abnormal 
traffic. One major weakness with most NIDSs is that they can cause “collateral 
damage”. In other words, legitimate packets may be mistakenly treated as malicious 
packets. 

Upon the detection of a DDoS attack, the next thing we should do is to bring our 
server back to the Internet to allow access by legitimate users. We call this the "first 
response" stage. Currently, the only option at this stage is to contact the ISP and ask 
them to filter out the malicious traffic before it clogs our bandwidth. This solution has 
two drawbacks: 1) the process of getting help from an ISP can take hours or even 
days, allowing the attack to continue [4]; 2) in theory, the attacker can further 
consume all the ISP's bandwidth by increasing the size of flood. 

After we have applied the first-aid patch, we want to find out where the attack is 
coming from. If the source addresses of the attack traffic are spoofed, we can attempt 
to trace them back to their origins by examining log in each router hop-by-hop. The 
weaknesses with this approach are: 1) it is a slow manual process; 2) this process can 
be easily thwarted if one or more routers along the path do not have the facility for 
identifying the upstream source. Recently, more promising traceback technologies 
using some packet-marking mechanisms have been proposed by [5] and [6]. 

During the last stage, the "second response" stage, there is very little that current 
technology can offer in the way of help. The hosts that are generating the flood are 
unlikely to be the attacker's own hosts, but hosts that have been hacked and exploited. 
Therefore, even if we obtain the IP address of each host participating in the attack, we 
are left with only two options: 1) to contact the owner of each host directly and 
inform him that his computer might have been hacked and is participating a DDoS 
attack; or 2) to contact the ISP of each host and ask them to block the flood at the 
uppermost stream possible. Since most DDoS attacks involve hundreds or thousands 
of hosts, neither option is practical. 

The Aegis system is designed as an integrated solution to address issues 
pertaining to four stages: detection, first response, trace back and second response 
(Fig. 2). The Active Networks paradigm offers us a new degree of freedom fo 
respond to malicious traffic automatically and effectively. 
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Stage 1 


Stage 2 


Stage 3 
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Prevention 


^ Detection 


^ 1st Response 
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^ Traceback 


)>2nd Response 


Current Solutions: 

- Host scanning 
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Current Solution: 
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Current Solution: 
-call ISP 


Current Solutions: 

- Hop-by-hop logging 

- Packet Marking 


Current Solution: 

- Contact zombies or 
their ISPs 



AEGIS 



Fig. 1. Scope of Aegis. 



3 Active Networks and Our System Requirements 

We have chosen the Active Networks paradigm as the enabling technology for the 
Aegis system because of its potential and flexibility. Currently there are three major 
test beds available: ABone in the US [7], FAIN in Europe [8] and JGN in Japan [9]. 
Various EEs (Execution Environments) have been proposed and are often categorized 
into three groups [10]: Active Packets (such as Smart Packets [11] and Active IP 
Option [12]); Active Nodes (such as ANTS [13] and DAN [14]); and Hybrids (such 
as SwitchWare [15] and Netscript [16]). Although Aegis is not designed for any 
specific EE, our system requirements for the underlying routers include the following: 

1. New functions or network services implemented in standardized modular form 
can be loaded into routers and executed at runtime. We call these modules active 
code. Active code can act autonomously and move among routers as mobile 
agents do among end systems. 

2. Each piece of active code must have a specific owner and must be authorized to 
have full control over all packets associated with its owner. Let “I” denote the 
entire name space of global IP addresses and (s,d) denote a packet with source 
address s and destination address d. Every piece of active code residing in an 
active node belongs to a specific user, who owns a set of IP addresses, denoted as 
“O”. Each piece of active code has access rights to packets 
{(s,d)G [(OxI)u(IxO)]|s;^d} that are received by the active node. (i.e. IP packets 
in which the owner's IP addresses appear in the destination or source field). 
Library functions such as capturePacketBySrc(), capturePacketByDest(), 
capturePacketAssociatedWith() must be provided by the EE. Active code may 
then monitor, discard or modify these packets. This requirement has been 
proposed in [18]. 

3. Incoming packets with specified IP addresses can trigger active code. No 
proprietary headers, such as one proposed as ANEP [18], are required to 
distinguish the so-called “Active Traffic” from the “Passive Traffic”. This 
requirement is crucial since most attackers would not send packets with these 
proprietary headers. In accordance to this requirement, a routing table in an 
active node would look like the one illustrated in Figure 2. An incoming packet 
is either dispatched to a specific active code residing in the same node, or routed 
to a neighboring node. In this routing table, entries associated with active code 
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would always precede ones for regular traffic, and would be applied only to 
incoming packets (not outgoing ones, so that the processed packets could be 
forwarded to the next hop without causing loops). 



I Destination | Source | Foward to 



1 . 2 . 3.4 
10 . 50 . 0.0 

XXX 



Any 

11 . 12 . 13.14 

XXX 



Active Code A 
Active Code B 

XXX 



1 . 2 . 0.0 



29 . 15 . 20.1 



dispatched 

to the EE (incoming 

packets only) 

foHA/areded to 
neighboring nodes 
(regular routing) 



Fig. 2. Routing Table for Active Nodes. 

4. Any given active node must have the knowledge of all neighboring active nodes. 
Note that a neighboring active node may not be the next-hop router, but it may be 
two or three routers away. 

5. Each active node supports class-based queuing (CBQ) for outgoing packets. 



4 Aegis 

4.1 The Basic Concept 




Fig. 3. Conventional Firewall. 



Fig. 4. Aegis. 



Fig. 3 illustrates how a conventional stationary firewall would attempt to fend off a 
DDoS attack. Assuming that the attackers have been successfully distinguished from 
the legitimate users, the firewall may effectively free up the server's computing 
resources by blocking floods from the attackers. However, if the attackers aim to 
consume the victim's network bandwidth, congestion is likely to occur between the 
ISP edge router and the victim's firewall. Legitimate visitors will still have trouble 
accessing the server and, as a result, the attackers will have essentially achieved their 
goal. 

By exploiting the programmability of Active Networks, however, the filtering 
process can be distributed and moved to optimal locations so that the unwanted 
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packets can be blocked more effectively while preserving as much bandwidth as 
possible. Fig. 3 illustrates this concept. Instead of attempting to block all unwanted 
traffic at one fixed location, Aegis filters packets in the upstream in a distributed 
manner. As a result, the damage caused by the attack is effectively distributed and 
congestion is far less likely to occur. 



4.2 Modeling DDoS Attacks 



For the purpose of describing our work, we have generalized bandwidth-consuming 
DDoS attacks into the following two models: 




Control 
T raffle 



Flood traffic with 
spoofed source 
address 



Model A: This model, illustrated in Fig. 5, includes DDoS attacks using ICMP Flood 
or UDP Flood. The "Master" represents the attacker's terminal, while the "Slaves" 
represent terminals being intruded upon and controlled by the attacker. DDoS 
software is often uploaded from the Master terminal to each compromised Slave 
terminal. Communication between the Master and the Slaves is called Control 
Traffic. In this model, each Slave terminal attempts to bombard the victim with 
massive amount of packets, sometimes with spoofed source address so that the victim 
is unable to determine the real source of the flood. Let fi denote the number of 
packets sent from a Slave per second, k the number of Slave terminals, s the size of 
each malicious packet in bytes and v the victim’s bandwidth in Kbps. Note that while 
/I depends on the capability of each Slave, ^ is often fixed across different Slaves. 
The attacker can effectively knock the target victim off the Internet if 



V < 




( 1 ) 



Now let us represent the Internet as an undirected graph INet=(V, E) where V is 
the set of network nodes and E is the set of physical links between the elements in V. 
Note that V includes both end systems and routers. Let SezV denote the set of Slave 
terminals and de V\S denote the victim. Assuming that the routes are fixed, we can 
represent each attack path of Model A from a Slave to a victim as 

Ai = <Si, Vi,i, Vi, 2,..., Vi,ni, d> 



( 2 ) 
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Each route is comprised of U; routers and nodes {Vij, Vini, d}. The terminal d 

receives packets with a source address Si , which may or may not be forged. 



A 

Slave 






Amplifying 

Network 



Amplifying 

Network 



Amplifying 

Network 



Amplifying 

Network 



4 . 

Victim 



1 Control 
j Traffic 



Unamplifed 
>- flood with spoofed 
^ source address 



Amplifed 

' flood with legitimate 
source address 



Fig. 6. Model B. 



Model B: This model, illustrated in Fig. 6, includes DDoS attacks using Smurf or 
Fraggle. Model B is characterized by the use of amplifying networks to magnify the 
damage. The attacker commands each Slave to send spoofed ICMP Echo packets or 
UDP port-7 (echo) packets to the broadcast address of some amplifying network 
(shown in Fig. 7). The source address of each Echo request is forged with the victim's 
address, so that all the systems on the amplifying network will reply to the victim. 
The number of systems available to reply to Echo requests is often called the 
amplifying ratio (denoted as X). Using this model, the attacker can effectively knock 
the target victim off the Internet if: 



V < 




( 3 ) 



In a similar manner, we can represent each attack path from a Slave to a victim in 
Model B as: 



Bi = <Si, V;,i, Vi,2,...V;,n,i-l, tt;, Vi,n,i+1..-, Vi,„i, d> (4) 

The m* router (l<m;<ni) from the Slave terminal is denoted by i, which represents 
the edge router of an amplifying network. Each packet generated by the attacker in 
the subpath < S;, V;j, Vi_ 2 ,...Vi_n,i_i, aj> carries a spoofed source address d, and the 
address of the victim, so that the subpath appears as < d, Vjj, Vi_ 2 ,...Vi_mi-i, OCj>. 
However, each packet originating from the amplifying network in the subpath <ai, 
Vi_mi+i..., Vi_ni, d> carries a legitimate source address due to the way in which Echo 
replies. 
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From Attacker 




4.3 Components 

The Aegis system consists of three core components: Commander, Shield and Probe. 
Fig. 8 depicts a possible deployment of the Aegis system on the Internet. 




The Internet is given as INet=(V, E), and the Active Network that satisfies our system 
requirements is given as ANet=(V’, E’) in which V’czV and E’czE. Shield and Probe 
modules may reside in any node vg V’. Fig. 9 presents a logical view of the Aegis 
system. The role of each component is described in the following sections. 
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Shield Shield 

Level 2 Level 1 




4.3.1 Shield 

The Shields have been designed to act during the “first response” stage. The Shield 
modules are implemented as mobile code. They can be executed at runtime in the 
Execution Environments (EE) of Active Networks that satisfy the requirements listed 
in Section 3. Residing in an active node, each Shield module monitors the data 
packets flowing towards the protected host or network. This is possible through the 
third requirement in Section 3, which entitles these modules to full control over all 
packets associated with the owner of the module. Each Shield is equipped with a 
traffic classifier, which determines the probability that the inbound traffic is part of a 
DDoS flood. Packets with a high suspicion-level are pushed into low-priority output 
queues and vice versa. (Fig. 10) The goals of this design are twofold: 1) to increase 
the likelihood of normal traffic reaching the protected site under attack, while 
preventing suspicious traffic from doing so; and 2) to give more flexibility and 
precision to the DDoS detection algorithms. Determining whether unusual bandwidth 
consumption is caused by a DDoS attack or simply by a peak in regular traffic is not a 
trivial problem. Even if an attack is confirmed, separating the DDoS flood from the 
legitimate traffic cannot always be done with reasonable confidence. To minimize 
“collateral damage”, the detection algorithm should not yield simple true-or-false 
binary output, but rather, a rating of the likelihood that certain traffic is engaged in an 
attack. Only packets that are extensively anomalous and suspicious should be 
discarded on the spot. The traffic classifier in each Shield performs two types of 
inspection: 

Local Stateless Inspection: This type of inspection is applied to each inbound 

packet based on a number of static parameters. These parameters may include 
source/destination addresses, protocol values in IP headers, packet length, destination 
port numbers, the ICMP type and hashes of part of the payload. For example, if the 
host under protection is a web site and it is certain that the host would not send Ping 
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requests to any outside host, Shields should be configured to immediately discard all 
inbound ICMP packets with type 0 (echo reply) and all UDP packets with source port 
number 7. This policy is very effective in defending against Smurf and Fraggle 
attacks. Another example of anomaly that can be inspected statelessly is packet with 
identical source and destination IP addresses. Some attacks send their target hosts 
packets with address of these hosts as both the sources and destinations, causing these 
hosts to go into a loop. Shields should therefore discard all packets with such an 
anomaly. 




Fig. 10. Queuing Based on Suspicion Level. 

Local Stateful Inspection: This type of inspection is regularly applied to the log file 
recorded by the Shield. The log file keeps count of certain parameters of traffic over 
a certain period of time in order to facilitate the detection of anomalies. For example, 
if there is a sudden influx of maximum-sized packets coming from the same sources 
over a period of 5 minutes or so, the Shield should raise the suspicion-level of packets 
coming from these sources and push them into the low-priority output queues. 
Another powerful inspection of this type has been suggested in [19], which defines 
packets to be malicious if they are destined for a host from which too few packets are 
coming back. We can set the Shields to spot disproportional packet rates. 

Thus, by distributing Shield modules in the upper stream, we hope to achieve two 
goals: 1) to comfortably absorb most DDoS attacks and significantly decrease the 
likelihood of congestion; and 2) to allow sampling and inspection of network traffic at 
different locations. 

4.3.2 Commander 

The Commander acts as the central control center of the Aegis system and dispatches 
Shield modules outwards to the surrounding Active Nodes. The Commander 
performs the following operations: 

Shield Level Configuration: Shield level is the depth that the Shields expand 

outward into the public network. The Shields are then constructed into a breadth-first 
tree. In Figure 7, the Shield level is set to 2, but it could be increased dynamically 
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when the protected site is under heavy attack. The Commander also has exclusive 
control over all dispatched Shield modules, and is able to terminate them at will. 

Global Stateful Inspection: While the two types of inspection performed locally at 
each Shield can effectively block certain types of attacks, some anomalies can be 
detected only by analyzing an overall view of the network activity. In the initial stage 
of deploying Aegis, the Commander undergoes a learning stage by regularly taking a 
snapshot of incoming traffic and applying statistical algorithms to obtain a 
“fingerprint” of normal traffic patterns, which can vary from network to network 
depending on the purpose of the protected site. The Commander can than use this 
fingerprint to detect anomalies, such as a unusually high degree of randomness in the 
source addresses of inbound packets, which may suggest an attack with randomly 
spoofed traffic. Once an anomaly is detected, the Commander can issue new filtering 
rules to all Shields. 

4.3.3 Probe 

In order to minimize the damage caused by a DDoS attack, the Aegis system takes 
one step further and uses Probes, which are designed to act during the “traceback” and 
“second response” stages. If the Aegis system confirms a DDoS attack and decides to 
block the flood completely, it sends Probes towards the sources of the flood in order 
to achieve two goals: to block the flood at the uppermost stream possible, and to 
gather evidence of the attack for legal purposes. The Probes are also implemented as 
Active Network modules and behave like mobile agents that can move from hop to 
hop towards the flood sources. In the following section, we will develop a probing 
algorithm as we attempt to counter both Model A and Model B DDoS attacks. 

Countering a Model-A DDoS Attack: The source addresses of packets received 
may or may not be spoofed. Because there is no way of telling if these addresses are 
legitimate or not, we can only assume initially that all of them are spoofed. We need 
some backtrack mechanism to identify a better position to block the malicious flood 
in the upstream. A number of research efforts have been dedicated to tackling this 
problem ([5] and [6]) using packet marking. While it is possible to integrate these 
proposals into Aegis, we have taken a different approach; one that utilizes the 
flexibility offered by Active Networks. Our probing algorithm is outlined in the 
following pseudo-code. The source address of the tracked traffic found by the 
countering Shield is stored in the variable attSrc. In accordance with the second 
requirement in Section 3, we can assume that each Active Node will provide us with a 
function that will allows us to capture packets associated with the owner of the 
module. (capturePacketByDest in line 3). 

Starting from the countering Shield, Probes are continuously replicated and 
dispatched to neighboring nodes in the direction of the source of the attack. After 
arriving at a new node, each Probe performs one of the following sequences of 
actions: 

1) if the traffic being tracked is found in the new node (line 3-4): block the traffic 
(line 5) ^ dispatch replicates to adjacent nodes except the predecessor (line 7,8) ^ 
self-destruct when the tracked traffic no longer comes (line 11, 12). 
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2) if the traffic being tracked is not found in the new node after a certain time: self- 
destruct (line 11, 12). 

//Arriving at a new node u 



1 . 


explored^- false 




2 . 


do 




3 . 


p<— capturePacketByDest 


(victim's IP) 


4 . 


if p.src=attSrc then 




5 . 


discard (p) 




6 . 


if ! explored 




7 . 


for each ve Adj acent [u] 


8 . 


if vgpredecessor [u] then 


9 . 


dispatch a 


replicate of 
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else releasePacket (p) 





11. while (p^^NIL) and ( ! IdleTimeOut) 

12 . self-destruct 

Fig. 11 illustrates the progress of this traceback mechanism on a portion of the 
network. 





Fig. 11. Traceback. 
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Recall that in Model A, each attack path is denoted as A, = <Si, Vij, Vini, d>. 

The goal of the above algorithm is to find Vi,min such that Vi_min e Anet, and to block all 
unwanted traffic there. This has the effect of freeing routers Vi_n,;n+i_ Vi_min+ 2 ,..., Vi,ni from 
processing packefs launched by the attacker. Once the probe reaches Vi ^in, where no 
more neighboring active node can be found in the direction of flood source, it reports 
its location (IP address) to the Commander for the purpose of evidence gathering. 

Countering a Model-B DDoS Attack: Recall that in a Model B DDoS attack, each 
attack path is denoted as B; = <s,, Vij, Vi_ 2 ,...Vi_mi-i, OC;, Vi_mi+i..-, V;_ni, d>. Each packet 
generated by the attacker in the subpath <S;, Vij, Vi_ 2 ,...Vi_mi-i, 0C;> carries a spoofed 
source address d, while each packet in the subpath <tti, Vi^i+i..-, Vi_„i, d> carries a 
legitimate source addresses of hosts in the amplifying network. Since the Shield 
would initially store in attSrc the source addresses of hosts residing in the amplifying 
network, the Probes can explore only as far as Vi min such that Vi min e Anet and m<min 
<n, using the algorithm described above. If there exists V;jj such that Vi_k g Anet and 1 
<k<m, by moving our Probes into Vi k, we can further reduce the unwanted traffic by 
up to a factor of the amplifying ratio. We now need to upgrade our algorithm so that 
the Probes would move beyond the amplifying network. This can be done by 
modifying line 3 and 4 in the original algorithm (see psuedo-code below). This 
time, the Probe examines all packets associated with the victim’s IP (i.e. packefs in 
which the victim’s IP appears as the source or destination). In addition to the original 
condition in line 4, if the packet appears to have originated from the victim and to be 
destined for the broadcast address (such as 255) of the attSrc, the packet is 
immediately discarded. These modifications allow the Probes to penetrate further 
towards the Slave terminals. 



/ /Arriving at a new node u 

1. explored^— false 

2 . do 

3. p^capturePacketAssociatedWith (victim' s IP) 

4. if (p . src=attSrc) or ( (p . src=victim' s IP) and 
(p . dest=atSrc ' s broadcast)) then 

5. discard (p) 

6. if ! explored 

7. for each ve Adj acent [u] 

8. if vgpredecessor [u] then 

9 . 



dispatch a replicate of Probe to v 
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10. else releasePacket (p) 

11. while (p^^NIL) and ( ! IdleTimeOut) 

12 . self-destruct 



5 Future Work 

In our future work, we will attempt to taekle the following issues: 

Dynamic Routing. In our framework, we have assumed that all attack paths are 
fixed; however, it is possible to experience route changes during an attack. We 
therefore need to consider such a scenario and incorporate it into Aegis. 

Tracking Control Traffic. Currently Aegis is able to trace back up to the Slave 
terminals. In order to facilitate legal actions, it is necessary to go beyond the Slaves 
and trace the control traffic in order to find the Master machine. This is an 
enhancement we plan to work on in the future. 

Finding/Designing a Suitable EE. We are looking for a suitable EE that meets the 
requirements listed in Section 3. If none of the existing ones satisfy our needs, we 
plan to design one by ourselves and implement a prototype. 



6 Conclusion 

In principle, the Aegis is a distributed firewall system designed to defend against the 
distributed nature of DDoS attacks. We have proposed an effective solution that cuts 
off unwanted traffic automatically at the upper stream, thus freeing the victim’s 
network from massive bandwidth consumption. Although Aegis cannot be 
incorporated into the Internet in its present form to offer an immediate solution to 
DDoS attacks, we hope that we have at least demonstrated some of the new 
possibilities that Active Networks can offer. 
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Abstract. Grid computing is a promising way to aggregate geographi- 
cally distant machines and to allow them to work together to solve large 
problems. 

After studying Grid network requirements, we observe that the network 
must take part of the Grid computing session to provide intelligent adap- 
tative transport of Grid data streams. 

By proposing new intelligent dynamic services, active network can be 
the perfect companion to easily and efficiently deploy and maintain Grid 
environments and applications. 

This paper presents the Active Grid Architecture (A-Grid) which focus 
on active networks adaptation for supporting Grid environments and 
applications. 

We focus the benefit of active networking for the grid on three aspects; 
High performance and dynamic active services. Active Reliable Multi- 
cast, and Active Quality of Service. 



1 Introduction 

In recent years, there has been a plethora of interest on Grid computing which is 
a promising way to aggregate geographically distant machines and to allow them 
to work together to solve large problems. Most of proposed Grid frameworks are 
based on Internet connections and do not make any assumption on the network. 
Grid designers only take into account of a reliable packet transport between Grid 
nodes and most of them choose TGP/IP protocol. 

But one of the main complaint of Grid designers is that networks do not 
really support Grid applications. 

Meantime, the field of active and programmable networks is rapidly expand- 
ing. These networks allow users and network designers to easily deploy new ser- 
vices which will be applied to data streams. While most of proposed systems deal 
with adaptability, flexibility and new protocols applied on multimedia streams 
(video, audio), no active network efficiently deal with Grid environments. 

In this paper we try to merge the both fields by presenting The Active Grid 
Architecture (A-Grid) which focus on active network adaptation for support- 
ing Grid environments and applications. This active Grid Architecture proposes 
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solutions to implement the two main kind of Grid configurations: Meta-cluster 
computing and global computing. In this architecture the network takes part of 
the Grid computing session by providing efficient and intelligent services dedi- 
cated to Grid data streams transport. 

We focus on the benefit of Grid active networking for: High performance and 
dynamic services deployment, Reliable Multicast and Quality of Service. 

This paper reports on our experience in designing an Active Network sup- 
port for Grid Environments. First we classify, the Network Grid requirement 
depending on environments and applications needs (section]^. In section 21 we 
propose the Active Grid Architecture. We focus our approach by providing sup- 
port for the most network requirements from Grid: High performance transport 
(section^. End to end Grid QoS services (section EJ and reliable multicast 
(section Ej) . We conclude and present our future works in last section. 

2 Network Requirements for the Grid 

A distributed application running in a Grid environment requires various kind 
of data streams: Grid control streams and Grid application streams. 

2.1 Grid Control Streams 

First of all, we can classify the two basic kind of Grid usage: 

— Meta cluster computing: 

A set of parallel machines or clusters are linked together with Internet to 
provide a very large parallel computing resource. Grid environments like 
Globus IT^. MOL 23, Polder 0 or Netsolve^ are well designed to handle 
meta-cluster computing session to execute long-distance parallel applica- 
tions. 

We can classify various network needs for meta-clustering sessions: 

• Grid environment deployment: The Grid infrastructure must be easily 
deployed and managed: OS heterogeneity support, dynamic topology re- 
configuration, fault tolerance. 

• Grid application deployment: Two kind of collective communications 
are needed: Multicast and gather. The source code of applications is 
multicast to a set of machines in order to be compiled on the target 
architectures. In case of Java based environments, the bytecode can be 
multicast to a set of machines. In case of an homogeneous architecture, 
the binaries are directly sent to distant machines. After the running 
phase, results of distributed tasks must be collected by the environment 
in a gathering communication operation. 

• Grid support: The Grid environment must collect control data: node 
synchronization, node workload information. The information exchanged 
are also needed to provide high-performance communications between 
nodes inside and outside the clusters. 
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— Global or mega-computing: These environments usually rely on thousand of 
connected machines. Most of them are based on computer cycles stealing 
like Condor |2n|. EntropiaQ, Nimrod-Gfn]) or XtremWeb0. 

We can classify various network needs for Global-computing sessions: 

• Grid environment deployment: Dynamic enrollment of unused machines 
must be taken into account by the environment to deploy tasks over the 
mega-computer architecture. 

• Grid application deployment: The Grid infrastructure must provide a 
way to easily deploy and manage tasks on distant nodes. To avoid the 
restarting of distributed tasks when a machine crashes or become un- 
usable, Grid environments propose check-pointing protocols, to dynam- 
ically re-deploy tasks on valid machines. 

• Grid support: Various streams are needed to provide informations to 
Grid environment about workload informations of all subscribed ma- 
chines. Machine and network sensors are usually provided to optimize 
the task mapping and to provide load-balancing. 

Of course, most of environments work well on both kind of Grid usage like 
Legion^, Globus jT^, Condor pilj. or Nimrod-Gp~Ii. 



2.2 Grid Application Streams 

A Grid computing session must deal with various kind of streams: 

— Grid application input: During running phase, distributed tasks of the ap- 
plication must receive parameters eventually coming from various geograph- 
ically distant equipments (telescopes, biological sequencing machines,. . . ) or 
databases (disk arrays, tape silos,. . . ). 

— Wide-area parallel processing: Most of Grid applications consist of a sequen- 
tial program repeatedly executed with slightly different parameters on a set 
of distributed computers. But with the emergence of high performance back- 
bones and networks, new kind of real communicating parallel applications 
(with message passing libraries) will be possible on a WAN Grid support. 
Thus, during running phase, distributed tasks can communicate data be- 
tween each others. Applications may need efficient point to point and global 
communications (broadcast, multicast, gather,. . . ) depending on application 
patterns. These communications must correspond to the QoS needs of the 
Grid user. 

— Goupled (Meta) Application: They are multi-component applications where 
the components were previously executed as stand-alone applications. De- 
ploying such applications must guarantee heterogeneity management of sys- 
tems and networks. The components need to exchange heterogeneous streams 
and to guarantee component dependences in pipeline communication mode. 
Like WAN parallel applications, QoS and global communications must be 
available for the components. 
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Such a great diversity of streams (in terms of messages size, point to point or 
global communications, data and control messages,. . . ) requires an intelligence 
in the network to perfectly support Grid requirements. 



3 Active Grid Architecture 

We propose an active network architecture dedicated to Grid environments and 
Grid applications requirements: The A-Grid architecture. 

An active grid architecture is based on a virtual topology of active network 
nodes spread on programmable routers of the network. Active routers, also called 
Active Nodes (AN), are deployed on network periphery. 

Gontrary to a wide active routers deployment approach and to guarantee 
high performance packets transport, we do not believe in the deployment of 
Gigabit active routers in backbones. If we consider that the future of WAN 
backbones could be based on all-optical networks, no dynamic services will be 
allow to process data packets. So, we prefer to consider backbones like high 
performance well-sized passive networks. We only concentrate active operations 
on edge routers/nodes mapped at network periphery. 

Active nodes are connected between each other and each AN manage com- 
munications for a small subset of Grid nodes. Grid data streams cross various 
active nodes up to passive backbone and then cross another set of active nodes 
up to receiver node. The A-Grid architecture is based on Active Node approach: 
Programs, called services, are injected into active nodes independently of data 
stream. Active nodes apply these services to process data streams packets. Ser- 
vices are deployed on demand when streams arrive on an active node. 

3.1 Active Grid Architecture 

To support most of Grid applications, the Active Grid architecture must deal 
with the two main Grid configurations: 

— Meta cluster computing (Fig. [IJ: 

In this highly coupled configuration, an active node is mapped on network 
head of each cluster or parallel machine. This node manage all data streams 
coming or leaving a cluster. All active nodes are linked with other AN 
mapped at backbone periphery. An Active node delivers data streams to 
each node of a cluster and can aggregate output streams to others clusters 
of the Grid. 

— Global or Mega computing (Fig. ED: 

In this loosely coupled configuration, an AN can be associated with each 
Grid node or can manage a set of aggregated Grid nodes. Hierarchies of 
active nodes can be deployed at each network heterogeneity point. 

Each AN manages all operations and data streams coming to Grid Nodes: 
subscribing operations of voluntary machines, results gathering, nodes syn- 
chronization and check-pointing. 
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Fig. 1. Meta Cluster Computing Active Grid Architecture. 




Fig. 2. Global Computing Active Grid Architecture. 



For both configurations, active nodes will manage the Grid environment 
by deploying dedicated services adapted to Grid requirements: management of 
nodes mobility, dynamic topology re-configuration, fault tolerance. 

3.2 Active Network Benefits for Grid Applications 

Using an Active Grid architecture can improve the communications needs of 
Grid applications: 

— Application deployment: To efficiently deploy applications, active reliable 
multicast protocols are needed to optimize the source code or binary de- 
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ployment and the task mapping on the Grid configuration accordingly to 
resources managers and load-balancing tools. An active multicast will re- 
duce the transport of applications (source code, binaries, bytecode,. . . ) by 
minimizing the number of messages in the network. Active node will deploy 
dedicated multicast protocols and guarantee the reliability of deployment by 
using storage capabilities of active nodes. 

— Grid support: The Active architecture can provide informations to Grid 
framework about network state and task mapping. Active nodes must be 
open and easily coupled with all Grid environment requirements. Active 
nodes will implement permanent Grid support services to generate control 
streams between the active network layer and the Grid environment. 

— Wide-area parallel processing: With the emergence of grid parallel applica- 
tions, tasks will need to communicate by sending computing data streams 
with QoS requests. The A-Grid architecture must also guarantee an efficient 
data transport to minimize the software latency of communications. Active 
nodes deploy dynamic services to handle data streams: QoS, data compres- 
sion, “on the fly” data aggregation. 

— Goupled (Meta) Application: The Active architecture must provide hetero- 
geneity of services applied on data streams (data conversion services,. . . ). 
End to end QoS dynamic services will be deployed on active nodes to guar- 
antee an efficient data transport (in terms of delay and bandwidth) . 

Most of services needed by Grid environments: High performance transport, 
dynamic topology adapting, QoS, on-the-fiy data compression, data encryption, 
data multicast, data conversion, errors management must be easily and efficiently 
deployed on demand on an Active Grid architecture. To allow an efficient and 
portable service deployment, we will present in next section our approach to pro- 
pose an active network framework easily mergeable with a Grid environment: 
The Tamanoir Framework. Then to resolve the main network Grid requirements 
identified in the previous section, we focus our approach on the two major ser- 
vices needs: QoS and reliable multicast. 

4 High Performance and Dynamic Service Deployment 

We explore the design of an intelligent network by proposing a new active net- 
work framework dedicated to high performance active networking. The Tamanoii0 
framework HH is an high performance prototype active environment based on 
active edge routers. Active services can be easily deployed in the network and 
are adapted to architecture, users and service providers requirements. 

A set of distributed tools is provided: Routing manager, active nodes and 
stream monitoring, web-based services library. Tamanoir is based on compiled 
JAVA/GGJ Uni with multi-threading approach to combine performance and 

^ Tamanoir {great anteater) is one of the strangest animal of south America only 
eating ants (30000 daily). We choose this animal in reference to the well-known 
active ANTS J27\ system. 
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portability of services, applications can easily benefit of personalized network 
services through the injection of Java code. 

4.1 Overview of a Tamanoir Node 

An active node is a router which can receive packets of data, process them and 
forward them to other active nodes. 

A Tamanoir Active Node (TAN) is a persistent daemon acting like a dynamic 
programmable router. Once deployed on a node, it is linked to its neighbors in the 
active architecture. A TAN receives and sends packets of data after processing 
them with user services. A TAN is also in charge of deploying and applying 
services on packets depending on application requirements. When arriving in 
a Tamanoir daemon, a packet is forwarded to service manager (figure . The 
packet is then processed by a service in a dedicated thread. The resulting packet 
is then forwarded to the next active node or to the receiver part of application 
according to routing tables maintained in TAN. 



Routing table 
of active nodes 




Fig. 3. TAN: Tamanoir Active Node. 



4.2 Dynamic Service Deployment 

In Tamanoir, a service is a JAVA class containing a minimal number of formatted 
methods {recv() and send() to receive a packet, apply a code on it and send the 
packet to another TAN, to the receiving application or even severals in the 
context of multicast service). Actually, each service used by an application is 



Active Networking Support for the Grid 



23 



inherited from a generic class called simply Class Service. We have used this 
technique in order to simplify the design of future services and especially to 
allow the TAN to download dynamically a new class. 

In each packet we find a label (or a tag) representative of the last TAN 
crossed by the packet. Therefore, if a TAN does not hold the appropriate service, 
a downloading operation must be performed. 

In figure El we can observe three kind of service deployment. The first TAN 
crossed by a packet can download the useful service from either the transmit- 
ting application, or from a service broker. By using an http address in service 
name, TAN contact the web service broker, so applications can download generic 
Tamanoir services to deploy non-personalized generic services. After, next TANs 
download the service from a previous TAN crossed by packet or from the service 
broker. 




service demand loading (1) 



service transmission (2) 



Fig. 4. Dynamic Service Deployment. 



4.3 Experiments 

We based our first experiments of Tamanoir on Pentium II 350 MHz linked with 
Fast Ethernet switches and compared Tamanoir system to the ANTS m most 
developed active network system. 

Results presented in figure El show the delay needed to cross an active node 
(latency). While ANTS needs 3 ms and is dependent of capsule payload size; 
Tamanoir time remains constant with a latency of 750 /iS. Meanwhile, ANTS 
process capability remains weak while Tamanoir goes 3 times faster. 

These first experiments show that Tamanoir framework can support a Grid 
environment without adding to much latency to all data streams. So Tamanoir 
can efficiently deploy services on active nodes depending on Grid requirements: 
QoS, data conversion, multicast, “on the fly” data compression. Next sections 
will focus on two main kind of services: QoS and multicast. 
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ANTS /Tamanoir Latency 




Fig. 5. Latency: Cost to Cross an Active Node. 



5 Active Grid Quality of Service 

In this part, we focus on the QoS problematic of the Grid flows and try to present 
the opportunities the active network technology offers to the QoS management 
and control of this type of streams. We study in particular how the solutions 
proposed for the multimedia applications can be derived to meet the specific 
requirements of the Grid applications. 

5.1 What Is Quality of Service? 

Quality of Service (QoS) represents the set of those quantitative and qualitative 
characteristics necessary to achieve the required functionalities of an application. 
In the Network community, QoS is a set of tools and standards that gives network 
managers the ability to control the mix of bandwidth, delay, variance in delay 
(jitter) and packet loss. Controlling this mix allows to provide better and more 
predictable network service . 

The problem of QoS appears in the Internet since it has become the common 
infrastructure for a variety of new applications with various requirements for 
QoS guarantees. The traditional best-effort model has not been designed to 
support time-sensitive and heterogeneous traffic. The emergence of multimedia 
applications requires new QoS solution. 

The first step is to introduce the capabilities required to support QoS in the 
Internet infrastructure and developing mechanisms and algorithms that scale 
while enabling a wide range of QoS guarantees. The second step is to enable 
users and applications to access these new capabilities. This last task is quite 
difficult and can be considered as one of the main reasons for the relatively slow 
deployment of QoS in the IP networks. 

Three types of QoS guarantees (or end-to-end QoS level) are proposed by an 
IP network: Best-effort, statistical guarantees or strict guarantees (absolutes). To 
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obtain the required end-to-end QoS different approaches are possible. They can 
be divided into two groups: The pure in-network mechanisms and the end-to-end 
mechanisms. The pure in-network mechanisms are based on resource reservation 
or on class-based services (like IntServ 0 or DiffServ 0). The main problem 
of the IntServ architecture is the scalability. This solution requires that control 
and forwarding state for all flows are maintained in routers. The DiffServ archi- 
tecture appear to be a more scalable and manageable architecture. It is focusing 
not on individual flows but on traffic aggregates, large sets of flows with simi- 
lar service requirements. The service differentiation goals are to accommodate 
heterogeneous applications requirements and users expectations and to permit 
differentiated pricing of Internet service. The DiffServ model requires that com- 
plex classification and conditioning functions are implemented only at network 
boundary nodes, and that per-hop behaviors are applied to aggregates of traffic 
which have been appropriately marked using the DS field in the IP header. 

Until now, neither IntServ nor DiffServ seems to offer the unique solution for 
all the various requirements. The aim of an end-to-end QoS mechanisms is to 
mask the deficiencies of the network QoS. For classical data transmission on a 
best-effort IP network, the role of the end-to-end TCP protocol is error control 
and error recovery by retransmission of the lost packets. For multimedia appli- 
cations, QoS mechanisms for adaptability (such as the forward error correction 
(FEC)) are incorporated in the adaptive application itself and not in the trans- 
port protocol (RTP g]). The advantages of this adaptive approach is that the 
application monitors the experimented QoS, and can detect variation and react 
appropriately. 



5.2 Grid QoS 

The QoS performances requirements of Grid streams are more disparate and flex- 
ible than for multimedia application. In traditional QoS approach, the streams 
specification includes quantitative parameters that can be classified in perfor- 
mance parameter (bit rate), temporal characteristics (delay, jitter (delay varia- 
tion)), integrity parameters (loss rate and error rate). 

If we suppose that some kind of QoS will be available in the near future in 
the Internet, one can ask if the Grid applications will benefit from the currently 
proposed QoS services. For example, the Grid community is interested by a 
guarantee on the delivery of a complete bulk data file, but not by the priority 
of each individual packet. This service differs from traditional QoS offerings 
in that the user specifies the ultimate delivery time when the data transfer 
must complete. To ensure that the transfer completes on time it is necessary 
to determine when the transfer should start and to control the transfer of the 
individual packets. 

An other Grid QoS service should for example provides information about 
the achievable throughput and about the stability-level for data-delivery between 
two points in the network. Network and throughput measurements are central 
to the Grid QoS problematic. 
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An other key component for the Grid is the dynamic generation of perfor- 
mance forecasts. For example, the Network Weather Service (NWS) m peri- 
odically takes measurements of the load of resources it has to monitor and uses 
them to generate performance forecasts. One of the NWS sensors is the network 
sensor m whose aim is to take measurements that represent the network qual- 
ity in term of latency and bandwidth. The different components of the NWS are 
distributed on monitored hosts. 

The GARA (Globus Architecture for Reservation and Allocation) pro- 
vides advance reservations and end-to-end management for quality of service on 
different types of resources (network, storage and computing) . This architecture 
remains a traditional solution, in the sense that the processing task associated 
with transport purpose are performed on the end systems (reservation and adap- 
tation) . 

5.3 The Active Grid QoS Approach 

In our active Grid QoS approach, we study how to implement new services in the 
active edge nodes. We propose an active QoS model cumulating the advantages 
of IntServ, DiffServ and the end-to-end adaptation mechanisms. Our active QoS 
approach allows to: 

— enlarge the QoS tools spectrum by processing on the individual flows, 

— maintain a scalable QoS approach like DiffServ in the core network, 

— realize a dynamic and efficient adaptation at the edge according to the real 
state of the network. 

Since end-to-end advanced network resource reservation is impossible on the 
Internet, we argue that for Grid flows, dynamic and specific adaptation is re- 
quired. For this, an active Grid QoS service should provide the user the ability 
to characterize a flow in term of end-to-end delay or end-to-end loss rate. 

It is also necessary to know the relative importance of a packet in order 
to know what to do with it in different network condition: dropping, slowing, 
storing, duplicating. In the congested nodes the time constrained flows must be 
treated in priority. The active nodes can have a finer vision of the individual data 
streams, and can react immediately to congestion and implement appropriate 
packet discard for each stream. 

In the Tamanoir architecture, capsules are transported. Data capsules can 
carry different types of information, that can be used for processing during the 
travel: 

— semantics from the application (type of payload, end-to-end target QoS per- 
formances) This information can be interpreted as an Active DiffServ code 
point (ADSGP). This code point is analogous to the DSGP of the traditional 
DiffServ model but can be application specific. This information character- 
izes the flow with high level and end-to-end information which is application 
specific and easier to handle than token bucket specifications in a resource 
reservation protocol like RSVP 0, 
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— self transfer monitoring information (eg. cumulated time of transfer), 

— state of the already crossed routers (heavy loaded, congested, etc.). This 
information similar to an ECN (Explicit Congestion Notification) can be 
processed by the active nodes on the way. 

QoS monitoring active services are associated to this QoS model and indicate 
the state of the nodes and network performances between two active nodes. 

On Tamanoir we have developed several prototypes of active QoS services: 

— an active QoS adaptation service, 

— an active DiffServ service, 

— an active monitoring service. 

These prototypes have been realized to demonstrate the ability of the Tamanoir 
approach for providing Intelligent QoS mechanisms on a classical best-effort IP 
network or on a DiffServ IP network. They have been validated on our local 
platform. 

The active QoS services we propose are able to modify the carried data 
during their travel in the network. A QoS adaptation can be made “on the 
fly” . This adaptation is function of the performance experienced by the packets 
of each particular flow. QoS adaptation means dropping , filtering operation 
(dynamic rate shaping, QoS filters) but also data staging. QoS signaling like 
informing the user/end application of degradation is an other task performed by 
the active agents. This adaptive approach is more efficient than the traditional 
adaptive application philosophy which regulates the flow according to report 
from the receiver. The latency of the reaction to congestion can be important 
and it can be dramatic especially if the application throughput is very high. 
In an active approach, the overload situation can be anticipated by active QoS 
monitoring and the QoS adaptation, located at the active edge router, made 
closer to overflowed router. The reaction to a congestion is then faster and the 
global QoS improved. 

At the deployment of the Grid architecture, specific services are downloaded 
and activated in edge active nodes. Ones are QoS monitoring agents responsible 
of the QoS parameters measurement. Other services, QoS adapters, intercept 
and process the flows when necessary. The agents are able to exchange reports 
and to communicate with hosts. 

Grid QoS services based on bandwidth requirements concern more applica- 
tions deployment and large parameters transfer between nodes. Delay based QoS 
services will concern Grid control streams (workload, fault tolerance, etc.) and 
data streams of pipelined coupled applications. 

6 Active Reliable Multicast for the Grids 

6.1 Reliable Multicast for the Grid 

Multicast is the process of sending every single packet to multiple destinations. 
Motivations behind multicast facilities are to handle one-to-many communica- 
tions in a wide-area network with the lowest network and end-system overheads. 
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In contrast to best-effort multicast, that typically tolerates some data losses and 
is more suited for real-time audio or video for instance, reliable multicast re- 
quires that all packets are safely delivered to the destinations. Desirable features 
of reliable multicast include, in addition to reliability, low end-to-end delays, 
high throughput and scalability. 

These characteristics fit perfectly the grid computing community as commu- 
nications in a grid make an intensive usage of data distribution and collective 
operations. In a very simple grid session, an initiator sends data and control 
programs to a pool of computing resources; waits for some results, iterates this 
process several time and eventually ends the session. The finer the computational 
grain is, the minimum the transmission end-to-end delay will need to be kept. 
It is also desirable to minimize the overhead at the source since it may need to 
gather results and build data for the next computing step. More complex sessions 
put higher demands on the network resources and on the multicast /broadcast 
communication facilities (cooperation among the receivers, receivers acting as 
sources for the other receivers, . . . ) . 

Meeting the objectives of reliable multicast is not an easy task. In the past, 
there have been a number of propositions for reliable multicast protocols that 
rely on complex exchanges of feedback messages (ACK or NACK) jl 21 17111^121)] . 
These multicast protocols usually take the end-to-end solution to perform loss 
recoveries. Most of them fall into one of the following classes: sender-initiated, 
receiver-initiated and receiver-initiated with local recovery protocols. In sender- 
initiated protocols, the sender is responsible for both the loss detection and the 
recovery m- These protocols do not scale well to a large number of receivers 
due to the ACK implosion problem in the source. Receiver-initiated protocols 
move the loss detection responsibility to the receivers. They use NACKs instead 
of ACKs. However they still suffer from the NACK implosion problem when a 
large number of receivers have subscribed to the multicast session. In receiver- 
initiated protocols with local recovery, the retransmission of a lost packet can be 
performed by any receiver El in the neighborhood or by a designated receiver 
in a hierarchical structure |^. All of the above schemes do not provide exact 
solutions to all the loss recovery problems. This is mainly due to the lack of 
topology information at the end hosts. 

In this section on multicast protocols, we show the benefits a computing 
grid can draw from an underlying active reliable multicast (ARM) service by 
comparing the performances (mainly the achievable throughput) of several active 
mechanisms with the non-active case. 

6.2 Active Reliable Multicast Explained 

In active networking, routers themselves play an active role by executing ap- 
plication dependent functions on incoming packets. Recently, the use of active 
network concepts |2S| where routers themselves could contribute to enhance the 
network services by customized functionalities have been proposed in the mul- 
ticast research community [201 E| • Active services for ARM contribute mainly 
on feedback implosion problems, retransmission scoping and cache of data. New 
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ARM protocols open new perspectives for achieving high throughput and low 
latency on wide-area networks. For instance, the cache of data packets allows for 
local recoveries of loss packets and reduces the recovery latency. Global or local 
suppression of NACKs reduces the NACK implosion problem and the subcast 
of repair packets only to a set of receivers limits both the retransmission scope 
and the bandwidth usage, thus improving scalability. 

Designing an efficient ARM protocol is not an easy task and difficult design 
choices must be made. In order to demonstrate the benefit of ARM on a com- 
puting grid, we will compare 3 generic protocols noted ^i, S2 and S3. Si uses 
the global suppression of NACK packets within active routers whereas S2 uses 
the NACK local suppression strategy (the receivers wait for a random amount 
of time prior to sending a NACK to the source). Finally, we have S3, which 
is similar to Si in performing a global NACK suppression strategy, that also 
implements the subcast service within active routers in addition to the NACK 
suppression service. The next subsection presents some performance results us- 
ing the previously described notations. At this point, we must mention that a 
full version of the results can be found in !2I|. 

6.3 Performance of Active Reliable Multicast 

In the following scenario, we will assume that the computing resources are dis- 
tributed across an Internet-based network with a high-speed backbone network 
in the core (typically the one provided by the telecommunication companies) 
and several lower-speed (up to IGbits/s) access networks at the edge, with re- 
spect to the throughput range found in the backbone. Our test scenario involves 
an initiator (source) and a pool of computing resources (receivers) where com- 
munication from the source to the receivers are multicast communications. We 
will call source link the set of point-to-point links and traditional routers that 
connects the source to the core network. Similarly, a tail link is composed of 
point-to-point links and routers connecting a receiver to the core network. Ac- 
tive routers are associated to the tail links (the low- to medium-performance 
Internet links) . However, it is possible that not all routers implement active ser- 
vices. Each active router Ai is responsible of B receivers Rn, ■ • • , Ris forming a 
local group. A receiver associated with an active router is said linked. The other 
receivers are said free. Figure El depicts the test scenario. 

Figure Clplots the ratio of linked receivers and active routers throughput as a 
function of the loss probability for S2 and S3. This figure illustrates the benefit 
of global NACK suppression when several local group sizes are defined. For 
reasonable loss probabilities, S3 performs better than S2 at the linked receivers 
end. This is because the linked receivers under S3 benefits from the subcast 
service. In 53, a linked receiver receives only once a data packet in contrast with 
S2 where a linked receiver could receive more than one copy of the same data 
packet. Moreover, in S2, a linked receiver can continue to receive NACKs from 
its active router every time a receiver in its local group has experienced a loss. 

The subcast facility has the advantage of unloading the receivers and/or 
the active routers depending on whether we benefit from this facility from the 
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Fig. 6. A Simple Grid Session Model. Fig. 7. Benefit of Glocal Suppression. 
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source or not. To see the benefit of performing the subcast from the active routers 
associated to the linked receivers, figure 0 plots the throughput ratio at a linked 
receiver in S 3 and S\ . We can see that the subcast permits a higher throughput 
at the linked receivers in S 3 . The gain obtained with the subcast depends on the 
local group size and the loss rate. These two parameters gives an idea on the 
number of receivers that have experienced a loss. Therefore, it is very beneficial 
to perform the subcast when the local group size is large (large scale distributed 
computing) . 

FigureOshows the impact of the active routers density on a protocol’s perfor- 
mances in term of the overall throughput. The figure plots the overall through- 
put gain as the number of active routers is increased compared to the no active 
routers case. We have N = 100 and have 1000 end-receivers. The number of 
active routers is varied on the x-axis and the y-axis shows the throughput ratio 
when compared to a non-active solution. Several multiplicating factors to the 
active routers’ processing power are applied (for instance 0.1 means 10 times 
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slower). We can see that with the same processing time at the active routers 
and the receivers, the overall throughput can be an order of magnitude higher 
if all the receivers are linked. Most interestingly, if the active router’s processing 
power is divided by 10 in S 3 , we can still double the overall throughput provided 
that 55 % of routers are active. Most predictions assume that the active router 
processing power will certainly be 5 or 10 times greater in a near future. How- 
ever, in case an active router is overloaded and exhibits less processing power 
than simple receivers, active services still provide more performances than the 
non-active case if the density of active routers is increased. 

7 Conclusion and Future Works 

We have studied the Grid computation models in order to determine the main 
Grid network requirements in terms of efficiency, portability and ease of deploy- 
ment. 

We then studied a solution to answer these problems: The active networking 
approach, where all network protocols can be deported in the network in a 
transparent way for the Grid designer and the Grid user. All communications 
protocols required by the Grid (multicast, dynamic topology management, QoS, 
data conversion, “on the fly” data compression,. . . ) can be implemented as active 
services deployed on demand in active nodes. 

We specially explored how active networking provides an elegant solution 
that can handle efficiently the QoS and multicast services required by Grid 
environment and Grid applications. 

We proposed such active network support: The Tamanoir Framework and 
studied active QoS and reliable multicast services on top of it. The first results 
are promising and should lead major improvements in the behavior of the Grid 
when the A-Grid support will be deployed. 

By proposing new intelligent services, active network can be the perfect com- 
panion to easily and efficiently deploy and maintain Grid environments and ap- 
plications. 

Next step will consist of merging the Tamanoir framework with a Globus Grid 
environment and we are currently adding active storage protocols by including 
in Tamanoir framework the distributed storing facilities provided by the Internet 
Backplane Protocol software (IBP jS|). This distributed storage facility in the 
network will help us to implement active reliable multicast service on top of 
Tamanoir environment. We have seen that with the active network technology, 
it is possible to efficiently transfer the QoS management and control functions 
inside the network. New active Grid QoS services will be proposed to allow active 
nodes to adjust Grid streams depending on QoS requirements. 
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Abstract. To allow end users define network behavior and to make 
communication network evolution service-driven, we propose fast and se- 
cure capsule processing environment, dedicated CPU architecture named 
StreamCode and related mechanisms. In our proposed architecture each 
packet contains a program for the packet written in StreamCode binary 
instructions. Every computational resources for packet processing is iso- 
lated for each packet and is cleared when execution time limit expires. 
This prevents malicious programs affecting other packets without virtual 
machines. Middleware attaches appropriate StreamCode programs con- 
sidering the user requirements. A sample application, contents-sensitive 
multicast on our StreamCode emulator, in which a streaming application 
on a server attaches appropriate multicast program for each contents, op- 
timizes packet loss in multicast in per-user and per-contents basis. 



1 Introduction 

End-to-end argument Q pointed out that some functions must be placed on end 
user terminals, whether or not equivalent functions are implemented on network 
nodes. It became a fundamental design principle of the Internet Protocol (IP), 
and brought us freedom to improve communication functions, e.g., the sophisti- 
cation of TCP congestion control mechanism or the introduction of URL, new 
addressing scheme without modifying routers. The freedom made the Internet 
application friendly, increased innovation speed of applications and contributed 
to the success of the Internet. 

Strange to say, in the very center of this application-oriented network lies a 
non-service-oriented mechanism: IP. The structure of IP is so simple that it is 
difficult to introduce application-oriented functions such as QoS or multicast. 
For QoS some users require strict guarantee, others require loose guarantee, and 
still others require no guarantee. Some applications allow partial degradation 
in multicast, others don’t. The diversity may exist even in an application. The 
simpleness and uniformity of IP is not well suited to handle such a wide range 
of requirements. 

Capsule-type active network technology 0 has the potential to be an ideal 
solution for such requirements. The granularity to define network behavior is 
packet, small enough for any applications. The definition is done by end users 
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(or programmers of end-user applications), thus it satisfies the philosophy of 
end-to-end argument: give freedom to end users and make network evolution 
application driven. We therefore believe that capsule should be the basis of the 
next generation IP or network layer protocol. 

However, traditional capsules are too slow to replace IP. ANTS 0 and PLAN 
^ uses virtual machine to achieve security, thus can achieve only several tens 
of Mbps throughput. SNAP 0 developed a sophisticated language to achieve 
security as well as high-performance. Loops and backward jumps are prohibited, 
thus the maximum memory consumption and the execution time of a program 
can be calculated with the program length before execution. The instructions of 
the language are simple and we may be able to develop dedicated processors to 
achieve higher performance. However, there are two difficulties in this approach. 
The first one is that it uses stack to treat variables. This increases the number 
of memory access, the bottleneck in today’s computers. The second one is that 
the language design policy cannot be applied if we access to the resources on a 
node, e.g., states in memories. We may therefore have to give up on using states 
in SNAP packets, which limits the applications. 

One common approach to achieve high performance is to develop dedicated 
hardware. We therefore decided to design capsule system optimized for hard- 
ware execution as much as possible, especially security mechanisms. We propose 
a processor architecture (instruction set and memory management functions) 
named StreamCode [H] in which program execution of each packet are separated 
on the basis of time; the processor terminates the execution if a certain time 
limit expires. We hope we will create multi-gigabit throughput capsule network 
with this architecture in the future while keeping security necessary for public 
networks. 

Hardware-oriented approaches already exist in active network areas, e.g.. 
Protocol Booster |7] and ANN [S|. In Protocol Booster programmable packet 
encapsulation is boosted by using FPGAs, and in ANN packet classification 
between active and traditional packets is accelerated by a chip named APIC. 
However, they do not give end users the freedom to define network functions 
which we want to achieve with active network technologies. 

This paper is organized as follows. In section 2 we describe the basic design 
concept and the structure of our StreamCode based active network. In section 
3 we discuss the StreamCode prototype system. Then in section 4 we describe 
the contents-sensitive multi-QoS multicast, an application that demonstrates the 
power of per-packet QoS customization. Then in section 5 we conclude the paper, 
discussing remaining issues. 

2 Architectural Description 

We assume two things for in-packet programs. 

— Each packet contains the whole program (not program ID) for the packet. 

This assures that the program is supplied to the StreamCode processor with- 
out any sophisticated fetching mechanism. 
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— In-packet programs are written in binary format of the processor instruction 
set. This enables processors to execute the incoming programs immediately. 

Figure Q is a sample in-packet program, hop-by-hop routing written in 
mnemonic form StreamCode instructions. The program is assembled into binary 
form and is included in every packet. When a packet with the program arrives 
at a node, a StreamCode processor in the node starts execution, assuming that 
the first byte of the packet is the beginning of the program. The packet looks up 
a routing table named ’DEFAULT JPv4’, allocates memory as an output buffer, 
copies itself to the buffer and transfers itself to the next node. 

Packet Format. Figure El shows the packet format. Except for the layer 2 
frame (IP header may be treated as this) and ANEP, there is no header, and no 
clear distinction between the program and user data (the so-called payload) . If a 
programmer wants to describe data in the packet, it should be protected by not 
executing the area. For example in figure [D program, destination address field 
is reserved by skipping the field at the beginning (SKIP ^ real_start). Payload 
area is secured by terminating program execution before the execution reaches 
there. 

Target Layer. StreamCode instruction set is designed to describe network layer 
functions, that is, to deliver a packet on end-to-end basis. It is free to use other 
layer information, for example upper layer information on intermediate nodes 
(e.g., particular type of users exist in this direction) or lower layer information 
(e.g., this link is congested) that is provided at nodes. However, the instruction 
set itself is designed for network layer functions. 

Target Plane. In-packet programs written in StreamCode are designed basi- 
cally for data plane. Since the current difficulty in extending IP functions lies 
mainly in data plane, the flexibility of capsule technology should focus there. It 
may also be used for control/management, but the primary target is data plane. 



2.1 Design Concept for High Performance 

To be practical for data plane packet processing, in-packet programs must be 
processed at comparable throughput and in a secure manner with today’s IP. 
Hereafter in this section we describe the design concept to achieve high perfor- 
mance. The concept for security is described after various components of the 
network are introduced, in section 

The basic idea that capsule may achieve enough performance to be the foun- 
dation of future networks comes from the observation that the speed of today’s 
logical circuit (e.g., CPUs or network processors) is fast enough to achieve multi- 
gigabit processing speed. The bottleneck of the processing is, and will be, the 
access to memories. We therefore made a design decision that we attach the 
whole program written in binary form StreamCode instructions to each packet, 
which minimize the memory access during in-packet program execution. 

As a result of the decision, all programs and payloads (the data to be handled 
by the programs) are supplied to the packet-processing unit from incoming port 
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//SC sample code for hop-by-hop routing 
start : 

SKIP -> real_start; 
dest_address : // dest addr. 

DATA { 192 168 10 05 }; 
contents_len: // payload’s len. w/o this SC 

DATA { 05 00 }; 
real_start : 

// IPv4 table lookup; r(2) : output IF id 
FUNC TABLE, 

table_id=DEFAULT_IPv4 , 

spc_id = sr(SR_INPUT_SPACE_ID) , 

addr = dest_address , len =4 => r(2); 

SKIP LE, r(2), 0 -> err; 
do_copy : 

// copies whole packet: input -> output 
// calculate packet length 
M0V4 i(contents_len) => r(3); 

ADD r(3), contents => r(4); 

FUNC ALLOC, // allocates output buffer 
use = OUTPUT, dst_if = r(2), 
size = r(4) => r(5); 

// actual payload length is here. 
SKIP LE, r(5), 0 -> err; 

FUNC COPY, 

source_space = sr (SR_INPUT_SPACE_ID) , 
source_addr = 0, length= r(4), 
dest_space = r(5), 
dest_address = 0 => r(6); 
do_output: // declaration of packet transfer 
// waits until the func copy ends 
WAIT r(6) ; 

FUNC 0UT_SINGLE, 

src_space = r(5), src_addr = 0, 
src_len = r(4), dst_if = r(2) => r(7); 

FIN r(7) ; 

err: //no error handling now. 

FIN; 

contents : 

// here comes actual user data. 



Fig. 1. Example StreamCode Program: Hoop-by-hop Routing. 
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Fig. 2. Packet Format. 



and goes out to outgoing port. Since the programs are written in hardware- 
decodable instruction set, they can be executed immediately. In an ideal situation 
where every instruction is executed in a clock and there is no loop or backward 
jump in the program, the packet goes through the execution unit in a pipeline. 
The pipeline works at the speed of the network, the same as the required one. 

In reality it is impossible to prevent pipeline stall completely. Routing table 
lookup causes memory access, and some instructions such as the calculation 
of checksum are almost impossible to complete in a clock. However, thanks to 
CAM technology it becomes possible to look up a routing table in a fixed clock 
fTI1 |. and using hardware-based parallel processing and sophisticated compiler 
technology we believe we can limit the effect of the stall. 

This approach has two drawbacks. The first one is the fact that lower layer 
MTU restricts the size of an in-packet program, since it must be in a packet. 
This makes it impossible to describe complicated algorithms as in-packet pro- 
grams. The second one is that the same program may be transferred repeatedly 
in the network, thus wasting the link bandwidth. The sample program in Figure 
n consumes 106 bytes after assembly. If we use cache technology like ANTS ^ 
or ANN |S|, we can relax these problems. However, we believe that if end-user 
applications can freely attach codes to packets, the number of programs will 
increase and may reach millions, which makes caching technology ineffective in 
core routers. Ethernet frame size can easily be expanded to 12K octet m and 
the link bandwidth is rapidly increasing thanks to xDSL, WDM and other tech- 
nologies, which will relax this problem. Complicated algorithms can be realized 
by putting sophisticated algorithms as node resident programs in EEs (this will 
be discussed in section We therefore decided to accept these drawbacks. 

2.2 Functional Architecture 

Shown in Figure El is the architecture of the node designed to handle Stream- 
Code programs in packets. The concept discussed above gives the basic structure 
for data-path in-packet program processing. Besides the processing environment, 
environment for control and management plane are necessary in nodes. We there- 
fore decided to put two environments for program execution in a node. One is 
StreamCode Processing Environment (SC-PE) where in-packet programs are 
executed. This is for data path functions. The other is Execution Environment 
(EE) where node-resident programs such as routing daemons are executed. This 
environment is for control and management functions. 

The architecture assumes that a high-end switch is, and will be, composed of 
a main controller that manages the whole node, intelligent IF cards and a fast 
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Fig. 3. Functional Architecture of a StreamCode Node. 



but dumb switching fabric. A physical image of such a node is shown in Figure 
El Since SC-PE is an abstraction of IF cards and a switch, there may be many 
SC-PEs in a node. EE is an abstraction of the main controller and there is only 
one EE in a node. 

SC-PE. SC-PE is the characteristic environment of this architecture in which 
incoming in-packet StreamCode programs are executed. Figure El shows the in- 
ternal architecture of SC-PE. 

When a packet arrives at a node, it goes into the input buffer through L1/L2 
input interface. Then execution unit starts the program execution, using regis- 
ters, functional units and if necessary, interacting with various memories through 
Memory Management Unit (MMU). The contents of the input buffer may be 
copied to output buffer byte by byte, or in a block using a burst-copy func- 
tion unit. When the output buffer is filled correctly, FUNC OUT_SINGLE or 
OUT_MULTI instruction (explained in the next subsection) in the program asks 
the block to move to input queue of the switching module, goes through the 
switch, and is sent finally to the next node through L1/L2 output interface. 

There are two types of memories in SC-PE. One is a temporary memory. 
This one is initialized on packet arrival and on quit, and can be used solely for 
the packet. Output buffer is allocated in this type of memory. The other is a 
permanent memory. Information commonly used among packets, e.g., routing 
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Fig. 4. Physical Image of a StreamCode Node. 



NodeOS, EE and middleware'] 

's’c-'p¥ 




ff 



Ji:; 

Td 



StreamCode proce.ssor 



Fig. 5. Internal Architecture of SC-PE. 



table information or user request information for contents-sensitive multicast 
(see section 0 ) is stored in this area. 

There is a security timer that watches input buffer, execution unit and tem- 
porary memory. Section ITU discusses this. 

EE and nodeOS. It is impossible to execute long-lived programs or complicated 
programs, e.g., routing daemons in SC-PE because execution time for an in- 
packet program is limited for security reasons (see section El) and the size of 
the program is limited because of the restriction of MTU. 

Most programs for control and management have such characteristics, and EE 
on nodeOS provides the environment for such programs. These EE and nodeOS 
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are equivalent to the one defined in DARPA’s active network working group m 
Using EE programs as the control channel, we can sophisticate algorithms for 
data path. For example if we need a filtering algorithm that considers congestion 
level in the network, it can be achieved by running a daemon on EE that collects 
congestion information and by showing the result to in-packet programs in simple 
table format. 

Interaction between SC-PE and EE. These two environments interact 
through protected memory access from the SC-PE to the nodeOS. A portion 
of the main memory of a node is shared memory, and can be looked up from SC- 
PEs as the permanent, read only memory. It may be convenient to allow write 
access, but since shared memory is not very fast and the access delay fluctuates, 
which is not desirable for SC-PE, we decided to abandoned it. 

There is no particular API to send data from in-packet StreamCode program 
to nodeOS or programs on EE. If such operation becomes necessary, StreamCode 
program can do it by making and tossing a small packet to the nodeOS or 
daemons on the nodeOS as usual packet transfer. 

2.3 StreamCode Instruction Set 

The program for SC-PE is written in StreamCode, a hardware-decodable in- 
struction set. It is composed of basic instructions shown in table Q and network 
specific library functions shown in table 0 

Table 1. Basic Instructions. 



SC Command 


Operation 


MOVl, MOV2, MOV4 


Load and store primitives 


ADD, SUB, 

MUL, DIV, MOD 


Arithmetic/logical calculation 


SKIP 


Conditional/unconditional jump 


NOP 


No operation 


FIN 


Finish program execution 


FUNC 


Call a library function 



The basic part is a set of RISC-like instructions whose flexibility in describing 
various programs and scalability in cost, power consumption and performance 
has already been proved with various implementations. It provides the ability to 
perform 64 bit memory addressing as well as the access to 256 registers. This is 
comparable to the instruction sets of current high performance microprocessors, 
e.g., the Intel IA-64 processor, and we hope that it is enough for future ap- 
plications. Table El shows available memory address sizes, and register numbers 
for various instruction sets. Such a large number of registers will also make it 
possible to use them as temporary buffers and contribute to reduce the memory 



access. 
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Table 2. Library Functions. 



Library function 


Operation 


ALLOC 


Allocate a buffer in temporary memory 


ALLOC_PERM 


Allocate a buffer in permanent memory 


FREE 


Freeing a buffer in permanent memory 


COPY 


Copy data between buffer 


TABLE 


Looks up a table 


OUT^INGLE 


Sends a packet to a specified interface 


OUTJVIULTI 


Sends packets to specified interfaces 



Table 3. Available Memory Address and Registers. 





memory address 


registers 


StreamCode 


64bit 


256 


IA-64 


64bit 


256 


MIPS64 


64bit 


64 


MIPS32 


32bit 


32 



Besides basic instructions, StreamCode includes network specific library func- 
tions for complicated, time consuming network-specific operations. These func- 
tions, e.g., routing table lookup, buffer allocation, burst copy from input buffer 
to output buffer, etc. will be implemented as a co-processor and will be executed 
in parallel. The result is placed into a register, and whether a FUNC instruction 
has been completed and the register is already available or not can be checked 
from the instruction decoder in the processor. Every time a new instruction is 
fetched the decoder checks the related registers and if any of them are not ready, 
the instruction decoding will be blocked and the pipeline will stall. To prevent 
such stalls programmers should execute several basic instructions after the result 
of a FUNC instruction becomes necessary. 

The reason to develop a new instruction set is to make the instructions 
variable in length and shrink the size of in-packet programs, whilst still keeping 
the hardware decoding easy. The StreamCode instruction set can describe the 
algorithm for contents-sensitive multicast (see section 01) with 384 bytes (table 
21, while the MIPS32 instruction set consumes 4110 bytes. The main reason 
of the huge difference is that StreamCode has a burst, random length DMA 
copy command that works efficiently in copying payload from input buffer to 
output buffer. If we assume that M1PS32 also has the command, it consumes 
544 bytes, still 41% larger than StreamCode. Any instructions except FUNCs 
can be converted to MlPS’s in a clock, which shows the easiness of hardware 
decoding of the instructions. 
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Table 4. Program Size of StreamCode, MIPS32, and MIPS32-DMA. 



Inst, set 


Number of Inst. 


Program size 


average inst. length 


StreamCode 


119 


384 bytes 


25.8 bit 


MIPS32 


4110 


16440 bytes 


32 bit 


MIPS32-DMA 


136 


544 bytes 


32 bit 



2.4 Security 

Since we use a hardware-decodable instruction set, we cannot rely on sophisti- 
cated language design for security like ANTS which uses Java virtual machine. 
We propose a simple security policy: 

Limit the execution time for in-packet programs. 

We isolate all the resources for each packet in SC-PE except permanent memory 
area. When the execution time limit arrives, all the resources used by the packet 
are cleared. This prevents malfunctioned or malicious packets affecting others. 
If an in-packet program malfunctions and the packet does not behave as the 
sender’s intention, it is his problem. 

The execution time allowed for a packet is the same as the time necessary to 
receive it; it is proportional to packet length. When a packet arrives at the input 
buffer of a node, security timer in SC-PE starts. It measures the time to receive 
the packet and record it. The execution of the in-packet program starts when a 
certain amount of bit has been stored in the input buffer. When the time to re- 
ceive the packet has passed from the start of the execution, security timer clears 
the registers and completely stops the program execution forcefully, except the 
burst copy from input buffer to output buffer. Programmers therefore must be 
careful that all the execution except burst copy completes before the time limit 
arrives. If a packet changes a state on a node incorrectly, it affects all succeeding 
packets, and the inconsistency must be solved by the application of end users. 
This may sound too radical, but we believe that this is a correct software evo- 
lution because today’s processors are becoming more and more dependent on 
sophisticated compilers. 

This security policy has two drawbacks. The first one is the restriction 
on StreamCode processor implementation. We must standardize the maximum 
number of clocks for each instruction execution, including load and FUNC in- 
structions to realize this policy. Without this standard a StreamCode program- 
mer cannot know whether or not his program can finish within the time limit. 
This restricts the implementation, especially the implementation of FUNC in- 
structions. Prefetching of related memories may be necessary. These are the costs 
to get the freedom to inject codes into networks. The second one is the fact that 
most packets must contain some payloads. If an in-packet program contains an 
instruction that does not complete in a clock, payload must be attached to meet 
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the security policy whether or not the payload is necessary. This causes the waste 
of bandwidth because some packets, for example some kinds of ICMP packets do 
not require payloads. However, as we stated in the performance-related design 
concept in section fI~T\ we believe that the flexibility will become more important 
than the waste of bandwidth in the future network, and we decided to accept 
these drawbacks. 

2.5 Session and State Information 

There is a permanent memory region in SC-PE that is not cleared even when the 
execution time limit for a packet arrives. As described in section this region 
is used for state handling and for information exchange among data packets. 
We will provide a few authentication mechanisms, from none to cryptographic 
one to restrict the access to this region. End-user applications should choose 
appropriate mechanisms considering their risks. 

Access right must be set in MMU to use this region. Since it is a time- 
consuming task, a user must send a signaling packet beforehand to do this. 
The packet allocates the required amount of memory and sets an ID to access 
it. Succeeding data packets show the ID to SC-PE and access to the allocated 
region. 



2.6 Middleware 

The time-consuming session management signaling will be done by node-resident 
programs on the EE, though in the current implementation in sectionElit is done 
by normal in-packet StreamCode programs. 

We believe that there should be various node-resident programs, or middle- 
ware on end-user terminals and on intermediate nodes. The functional difference 
between terminals and intermediate nodes is the difference of the programs. On 
end-user terminals a typical function is an advanced socket emulation by at- 
taching appropriate StreamCode programs to appropriate packets considering 
network congestion, failure and user requirements. On intermediate nodes (gen- 
eral/proprietary) routing daemons, user authentication for session management 
and logging are common functions. 

3 Proof-of-Concept Implementation 

StreamCode is designed to be implemented in hardware, but we made a proof- 
of-concept prototype on FreeBSD 4.2 in software because of the easiness of im- 
plementation. 

The primary purpose of this prototype is to check whether it is possible to 
write meaningful applications such as contents-sensitive multicast (see section 
E) with reasonable program length or the number of instructions. The proto- 
type is therefore a minimum set to achieve this goal and is composed of SC-PE 
and its minimum interface for packet exchange, i.e., packet receive and send. 
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Fig. 6. Overall Architecture of Experimental System. 



We call this SC engine. Other components and interfaces such as nodeOS, EE, 
middleware and the interaction between them will be implemented in the fu- 
ture. The program execution termination mechanism is also omitted. Whether 
the packet execution can finish within the time limit can be calculated from 
the standardized maximum instruction execution time, and the data for stan- 
dardization should come from hardware implementation. We therefore did not 
implement the execution termination mechanism in the current version. 

Figure El shows the architecture overview of the prototype. 

The development flow of StreamCode applications and StreamCode packet 
processing are as follows. 

1. A programmer writes StreamCode programs in mnemonic considering the 
user- and application-specific requirements. They are assembled and are 
stored in binary form. 

2. A StreamCode attachment module on a server integrates a stored binary 
program and a payload into a complete StreamCode packet. 

3. Current implementation uses UDP/IP as the layer two of StreamCode pack- 
ets. ANEP implementation is ongoing. The StreamCode programs and user 
data become payloads of UDP/IP packets and sent into the network. 
UDP/IP is used just for datalink layer. Routing must be done by Stream- 
Code program. 

4. When a node receives a packet, it executes the in-packet StreamCode pro- 
gram and, if required, transfer it to the next node by changing the destination 
address in the UDP/IP header. 

5. There is a special register in a StreamCode processor that shows whether the 
node is the edge of a StreamCode network or not. If a StreamCode program 
wants to deliver normal UDP/IP packet to the destination, this can be done 
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Fig. 7. Overall Architecture of Contents-Sensitive Multi-QoS Multicast. 

by not copying program part of the packet when the packet is in the node 
next to destination. 

If we use Intel Pentium III 933MHz processor and Netgear GA620 lOOOBase- 
SX NIC as the hardware and execute a hop-by-hop packet forwarding program, 
the SC-PE processes 6223 packets/sec, whose packet size is 8746bytes. It corre- 
sponds to 829Mbps throughput. 

4 Contents-Sensitive Mnlti-QoS Mnlticast 

Using StreamCode we can customize each packet’s processing algorithm and thus 
customize each packet’s QoS by attaching a different program to each packet. 
Contents-sensitive multicast is an application that shows the potential of this 
flexibility. 

Figure 0 shows the architecture of the system. A streaming media server 
multicasts movies and advertisements in serial form like today’s TV channels. 
For each video two or three streams of different quality are simulcasted, e.g., 
low quality movie and high quality movie. Different StreamCode programs are 
attached to movie packets and advertisement packets. 

There are two tables on intermediate StreamCode nodes. One is an appli- 
cation specific table, that is, a table that describes the requirements of users 
connected to each output port. Some users require high-quality movie because 
they are willing to pay for that. Some users require low-quality movie because 
they receive it on PDAs. For a user who does not pay, sponsor of the advertise- 
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merit requests the network to deliver the ad in high quality. The other table is a 
general congestion table of that output port. 

When a movie packet arrives at an intermediate StreamCode node, the pro- 
gram checks the quality identifier in the payload and looks up the user require- 
ments table. If the packet contains the same or lower than the quality requested 
by users of a port, it checks the congestion level and if the level is low enough for 
the packet to be transferred, it goes to the next node with the program. If the 
node is at the edge of an active network, the packet transfers itself to the next 
node (probably the user terminal) if it contains the highest quality video which 
manages to reach there in the congestion. On transfer for the last hop, current 
program is designed not to copy the program part of itself, thus the packets that 
the user terminals receive are traditional UDP/IP packets and can be played 
back with ordinary streaming media players. Advertisement packets behave in 
the same manner, though they look up different fields of the requirement table. 

The reason why we use simulcast of different quality video is the easiness 
of implementation. If we use more sophisticated, hierarchical encoding schemes 
such as WaveVideo we can unify these streams. 

We implemented this application on the prototype described in the previous 
chapter. It worked well with MicroSoft Media Player as the client software. 

As mentioned in table E] it consumes 384 bytes to describe this algorithm. 
This is not a small value, but we believe that this is acceptable as the cost of 
user-driven network evolution when MTU is extended to 9K bytes or more. We 
therefore conclude that in-packet programs in StreamCode can describe practical 
programs under MTU limitation. 

5 Conclusion and Future Plan 

We proposed a capsule type active network architecture and the processor archi- 
tecture, StreamCode, whose security is achieved by limiting in-packet program 
execution on time-basis. Using a proof-of-concept implementation, we also pro- 
posed a contents-aware multicast whose QoS is controlled on per-packet basis. 

There are many issues which have to be solved. One of the most important 
is implementing SC-PE in hardware, through which to prove the applicability of 
the security policy as well as to prove the competitiveness in performance with 
traditional IP routers. 

From the QoS control point of view, we plan to investigate the possibility of 
per-packet QoS customization further. Current method uses only the abstracted 
congestion level. It may contain various information such as failures or conges- 
tion levels of particular geographical areas. Packet handling method can also 
be improved. If congestion occurs packets are discarded as a whole now, but 
partial discard of a payload may make sense. We would like to evaluate these 
possibilities. 
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Abstract. As the Internet has become an infrastructure for the global 
communication, a network failure and a quality degradation have become 
a serious problem. In order to solve the problem, a network monitoring 
system which monitors the traffic of Internet in real time is strongly de- 
sired. Traffic monitors which collect the statistics from captured packets 
play a key roll in the system; however, they are not flexible enough for 
being used in the rapidly changing Internet. The traditional approach 
such that a new traffic monitor is developed for a new requirement re- 
sults in a long turn around time of the development. Therefore, we have 
proposed a flexible network monitoring system which consists of pro- 
grammable traffic monitors. Traffic monitors are made programmable 
by introducing active network techniques; therefore, we call the network 
monitoring system as the active monitor network. This paper describes 
the implementation and evaluation of the active monitor network. 



1 Introduction 

As the Internet has become an infrastructure for the global communication, a 
network failure and a quality degradation have become a serious problem. In 
order to solve the problem, a network monitoring system is desired in order 
to monitor the quality and traffic of Internet in real time. Traffic monitors are 
key elements of the network motoring system. They are real time measurement 
tools which collect the statistics of the traffic from packets captured from a 
tapped link. Many traffic monitors are used for various purposes. MRTG (Multi 
Router Traffic Grapher) P , which collects the number and bytes of IP packets 
from an MIB (Management Information Base) of a router, is used to detect 
congested links of an IP network. NetFlow |2| and NeTraMet |3|, which collect 
the number and bytes of IP packets and TGP packets for flows, are used to 
monitor a traffic amount of an individual user. Recently, new traffic monitors 
which collect TGP level quality statistics, are used to monitor a quality 
provided to an individual user. 

The traffic monitors are useful to monitor an IP network; however, they 
are not flexible enough for being used in the rapidly changing Internet. The 
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traditional approach such that a new traffic monitor is developed for a new re- 
quirement results in a long turn around time of the development. Although new 
network monitoring applications such as a DOS (Denial of Service) attacker de- 
tection and a real time traffic quality monitor are emerging, the traffic monitors 
for them cannot be developed in time for network operatorsf requirements. Be- 
sides, traditional traffic monitors work stand alone; therefore, it is difficult to 
develop a network monitoring system which monitors a whole IP network in real 
time using traffic monitors distributed over an IP network. 

In order to achieve a flexible network monitoring system, the introduction 
of active network approach p| is very promising. We propose an active monitor 
network which consists of programmable traffic monitors and a manager. We call 
a programmable traffic monitor as an active monitor. A manager can dynami- 
cally load an analysis program which analyzes packets captured from a tapped 
physical link. This programmability achieves a flexible network monitoring sys- 
tem. 

On the contrary, there has been much research |6l7l8lhlTT)| on introducing 
the active network into network management, of which network monitoring is 
an important role. In the above active network management systems such as 
SmartPacket 0, NetScript 0, and ANCORS m. a management node and an 
agent node are made programmable. The programmability achieves so an intel- 
ligent agent behavior that MIB information is automatically checked. However, 
the active network management systems cannot make a network monitoring 
node, i.e., a traffic monitor, programmable. Although a program which runs on 
an agent analyzes the MIB information, it cannot analyze captured packet them- 
selves. Recently, an active monitoring systems have been proposed m called as 
ANMOS Monitoring Tool. The active monitoring system focuses on monitoring 
an active network itself in order to know how an active network behaves. On the 
contrary, our active monitor focuses on traditional connectionless networks such 
as the Internet. Besides, our objective is to make a traffic monitor programmable 
so that network operators can collect traffic information which could not be col- 
lected unless parameters and sequences of captured packets were analyzed. We 
believe that our paper is the first proposal of the introduction of active network 
into traffic monitors in the literature. 

In this paper, we propose an active monitor network and discuss the imple- 
mentation and evaluation. In section 2, we describe the overview of the active 
monitor network. In section 3, we describe the implementation overview. In sec- 
tion 4, we describe the experiment results using the active monitor network. In 
section 5, we discuss the active monitor network approach. 



2 Overview of Active Monitor Network 

An active monitor network, as shown in Fig. 1, is a monitoring network which 
monitors an IP network consisting of routers and physical links. It consists of 
active monitors and a manager. An active monitor is a programmable traffic 
monitor, and consists of a platform and an analysis program which is remotely 
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loaded from a manger. An analysis program captures packets from a tapped 
physical link and analyzes the captured packets. The analysis results are stored 
in a result storage. A manager is also programmable. A monitoring application 
program manages active monitors, and runs on a manager platform. The appli- 
cation program remotely gets a result of an analysis program, and knows quality 
and traffic of a whole monitored IP network. 



Manager 



Analysis 



Load/Unload Active Monitor 



Application 

Program 




Proaram 




Analysis 






Program 




Active Monitor 



Fig. 1. Active Monitor Network. 



2.1 Programmability of Active Monitor 

Programmability of active monitor is achieved in the following way: An analysis 
program runs on a platform as shown in Fig. 2. It is loaded beforehand from a 
manager to an active monitor via a network at any time without the monitor’s 
being stopped. An analysis program corresponds to a programmable switch H2! 
in an active network. A standard programming language such as Java and C is 
used for writing an analysis program. 

The platform provides an execution environment to analysis programs. It 
provides the following functions: 

— Execution (Interpretation) of analysis program 

— Load of analysis program 

— Unload of analysis program 

— Message communication to a manager 

— Packet capture / filter 

2.2 Programmability of Manager 

A manager is also made programmable. A manager executes a network monitor- 
ing application program (application program in Fig. 1 and Fig. 2). An applica- 
tion program controls active monitors located at many places in an IP network. 
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Manager 



Active Monitor 



Application Program 






Analysis Program 



analyze 



Packet 




Tapped Physical Link 



Fig. 2. Manager and Active Monitor. 



The structure of manager is similar to an active monitor. It consists of an ap- 
plication program and a platform. The platform provides the load / unload, 
execution and message communication functions as the active monitor platform 
does. The platform also provides the topology management function, and it is 
described in section 2.4. 



2.3 Communication between Manager and Active Monitor 

The relationship of a manager and an active monitor is the same as that of 
a manager and an agent of traditional network management methods such as 
OSI management and SNMP based Internet management. An analysis program 
executed on an active monitor analyzes captured packets and stores an analysis 
result in a result storage like an MIB of SNMP. A manager gets an analysis result 
from an active monitor’s result storage using client-server style communication. 
Three client-server style message exchanges are defined for the three operations: 
load and unload of analysis program and get result. 



2.4 Network Topology 

The manager platform provides an application program with a topology infor- 
mation of a monitored IP network and an active monitor network. The topology 
information is a list of pairs of link identifiers from the following two view points: 
First, a link is identified by source and destination router IP addresses in a mon- 
itored IP network. Second, the same link is assigned a unique identifier, which 
consists of an active monitor IP address and an identifier in the active monitor, 
within an active monitor network. The topology information is used for many 
purposes by an application program. 
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Fig. 3. Object Structure of Active Monitor. 



3 Implementation of Active Monitor Network 

An active monitor and a manager have been implemented using Java as the 
software which runs on standard PCs (Personal Computers) and workstations. 
Java is adopted as a programming language for writing an analysis program 
because of high productivity and portability. Currently, the developed software 
runs on Solaris 2. 7/2. 8 and Solaris 2. 7/2. 8 for x86 operating systems. 

3.1 Active Monitor 

(1) Program Structure 

As shown in Fig. 3, an active monitor consists of Java objects (instances of 
Java classes) which run on the JVM (Java Virtual Machine). An Analysis ob- 
ject corresponds to an analysis program of section 2. The load and unload of 
an Analysis object is achieved by the creation and destroy of the class object of 
the Analysis class. The other objects constitute a platform. Individual functions 
of the platform are implemented as Java classes, and are executed as Java ob- 
jects. A System object manages the platform and controls the other objects. A 
MessageComm object provides message communication using the two standard 
Java objects : ServerSocket and Socket objects. A ServerSocket object is used to 
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listen a new TCP connection, and a Socket object is used to send and receive 
messages using TCP/IP. A Jpcap object is a public domain Java object 
and provides the packet capture and packet filter. A Jpcap object uses a libpcap 
and BPF (Berkley Packet Filter) programs which are outside the JVM for the 
packet capture and filter, respectively. Libpcap program is a public domain pro- 
gram which captures IP packets from a communication board. BPF is a public 
domain packet filter program. When a Jpcap object receives a packet from the 
libpcap/BPF program, it creates a Packet object. 

(2) Analysis Class 

When users need a new monitoring application, they write a class as a sub- 
class of the Super Analysis class which provides a framework of packet analysis. 
They need overwrite three methods: the init, handlePacket and getResult meth- 
ods. The init method is used to initialize an Analysis object. The handlePacket 
method is a main method. It is invoked by a Jpcap object when a packet is 
captured from a tapped link. The packet analysis is written in this method. The 
getResult method is used to send back an analysis result using a MesseageComm 
object. 

An example of Analysis class, ExAnalysis, is shown in Fig. 4. This class 
calculates how many UDP packets whose source IP and destination IP addresses 
are specified by a packet filter condition (aFilter in Fig. 4) are transferred on a 
tapped physical link. 

When an ExAnalysis object is created, the init method is invoked by a System 
object with linkid, srcIPaddr and aFilter arguments. The arguments are sent 
from an application object of a manager. The aFilter is a BPF filter condition 
which specifies UDP packets of the specified source and destination IP addresses. 
Then, the run method is invoked, and it starts a Jpcap object assigning a thread. 
When the Jpcap object finds a packet which matches the filter condition specified 
by the aFilter, it invokes the handlePacket method. The handlePacket method 
further checks whether the destination port number of the UDP packet is the 
same as the destination port number of the targetPort instance variable. This 
procedure is infinitely repeated by the time when the quit method is invoked. 

When an application object of the manager sends a get result request, a 
MessageComm object calls the getResult method. The getResult method checks 
whether the detected UDP packet number is larger than the threshold specified 
by the threshold. Then, it returns the result as a string to the MessageComm 
object, and the MessageComm object sends back the result to the manager. 

(3) API (Application Programming Interface) 

There is no constraint when writing Analysis classes. Users can use any meth- 
ods provided by Java standard classes. The classes of Jpcap object provide anal- 
ysis methods at MAC (Media Access Control), IP, UDP and TCP levels. The 
classes provide methods which get and set protocol parameters of the above pro- 
tocols. On the contrary, the analysis of the protocols which upper than TCP, 
such as SMTP and WWW, is written in the handlePacket method using Java 
by users themselves. 
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public class ExAnalysis extends SuperAnalysis { 
private int packNum = 0; // detected UDP packet no. 
private int threshold = 100; // threshold 
private int targetPort = 0; // destination port 

public void init (int linkID, String srcIPaddr, String aPilter, int dstPort, ...) 

{ 

targetPort = dstPort 

super.init (linkID, srcIPAddr, aFilter, 

} 

public void handlePacket (UDPPacket packet) { 
if (packet.dst_port == targetPort) { 
packNum++; // count up 
} } 

public String getResult () { 

String result; // Result storage 
If (packNum > threshold) { result = “TRUE”; } 
else { result = “FALSE”; } 
return (result); } 
public void run () { 
aJpcapObj.loopPacket (-1 ,this); } 
public void quit () { 

} } 



Fig. 4. Example Analysis Class. 



3.2 Manager 

(1) Program Structure 

The program structure of manager is shown in Fig. 5. The abstract class 
SuperApplication is provided for writing a network monitoring application. The 
platform consists of Java classes such as System, MessageComm, ServerSocket, 
Socket classes which are similar to those of the active monitor. Besides, the 
classes for topology management is provided. The class files of Analysis classes 
are stored as files. 

(2) Application Class 

A network monitoring application is written as a subclass of the SuperAp- 
plication class, and its object corresponds to the Application object in Fig. 5. A 
typical procedure of an Application object is as follows: First, the Application 
object loads a class file of an Analysis class to active monitors. The Analysis 
class is written beforehand and is compiled using a Java compiler to the class 
file by users. Second, the Application object gets analysis results from the loaded 
Analysis objects, and analyzes the results from the network wide view point. Fi- 
nally, the Application object unloads the loaded Analysis objects at the active 
monitors. 
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Fig. 5. Object Structure of Manager. 



(3) Topology Management 

A Topology object is provided so that users can write a method in consid- 
eration of locations of active monitors and a topology of a monitored IP net- 
work. The Topology object is a table of Link objects each of which represents 
a link. The link object contains source and destination router IP addresses, a 
bandwidth and an active monitor identification which consists of an active mon- 
itor IP address and a link identifier. The link identifier is assigned uniquely in 
the active monitor. Figure 6 and Table 1 show a monitored IP network and 
a table maintained by a Topology object, respectively. The monitored network 
consists of three routers whose IP addresses are 133.128.10.1, 133.128.12.1 and 
133.129.11.1. Three active monitors are located to tap the all links among the 
routers. Each line of Table 1 corresponds to a Link object. 

Many basic methods are provided to get the following basic information 
pieces from a Link object : an active monitor identifier, source and destination 
router IP addresses, a bandwidth of a link, a total Link object number, a Link 
object next to the currently accessed Link object and so on. In addition to the 
basic methods, many methods are provided to analyze a network topology of 
a monitored IP network. For example, the SPFmake method creates a short- 
est path first tree whose root is a specified. Band widths of a Topology object 
corresponds to metric of OSPF (Open Shortest Path First) routing protocol. If 
a monitored IP network uses OSPF as a routing protocol, the created shortest 
path tree can be used to know a route on which a captured IP packet is trans- 
ferred. The MaxHop method calculates a maximum hop (link) number from a 
specified source router to a specified destination router. 
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Fig. 6. Example Monitored IP Network. 
Table 1. Topology Object. 
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3.3 Message Communication 

The three operations (load, unload and get result) are achieved by an exchange 
of a request and a response using TCP/IP. MessageComm objects of a manager 
and an active monitor provide the communication. Figure 7 shows a typical 
example of message sequences between a manager and an active monitor. First, 
a manager sends a class file of Analysis object to an active monitor (load request 
at (i) of Fig. 7). After receiving the request, an active monitor starts executing 
the object and sends back the response to the manager. At some time goes, the 
manager sends a request to get an analysis result from the result storage of the 
active monitor (get request at (ii) of Fig. 7). Get requests can be set at any time 
and any times like a get request of SNMP. Finally, the manager sends an unload 
request to the active monitor (unload request at (iii) of Fig. 7). When receiving 
the request, the active monitor stops the execution of the Analysis object and 
unloads it. 
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Fig. 7. Example Communication Sequence. 



4 Network Monitoring Experiments 

In order to evaluate the developed active monitor network, we have performed 
the two experiments. First, we have developed a DOS attacker detection appli- 
cation writing Analysis and Application classes, and have had experiments over 
a test bed network. Second, we have preliminarily measured the performance of 
active monitor. 



4.1 DOS Attacker Detection Application Development 

(1) DOS Attacker Detection Problem 

Recently, DOS attacks such as ICMP Flood and TCP Flood are becom- 
ing a serious problem which degrades the Internet security. Usually, source IP 
addresses of DOS packets are forged (spoofed); therefore, a DOS attacker who 
sends DOS packets cannot be identified using source IP addresses of the packets. 
Due to this, the method of detecting DOS attackers is becoming an important 
problem H3 for a network monitoring system. 

We have developed a DOS attacker detection application. The DOSApplica- 
tion and DOSfilter classes are written. They correspond to the Application and 
Analysis classes, respectively. A DOSfilter object is used to detect DOS packets. 
As for DOS packets, the destination IP address is a target host of a DOS at- 
tack. Source IP addresses are forged; however, usually the addresses are in some 
network number specified by an IP address and an address prefix. Therefore, a 
DOSfilter object detects a DOS packet by filtering captured packets with the 
filter condition of the above destination IP address and the source IP network 
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number. The DOSfileter class is similar to the ExAnnalysis class of section 3.1, 
and returns TRUE or FALSE when a get result is requested. 

The DOSApplication class detects a link from which an attacker sends DOS 
packets. The algorithm of the class is explained using an experiment network 
configuration shown in Fig. 8. The network consists of 5 routers and 7 LANs 
(Local Area Networks). IP addresses of routers and network numbers of routers 
are shown in Fig. 8. For example, IPl is an IP address of a router, and NWl 
is a network number of a LAN. The detection algorithm is designed on the 
following assumptions: First, all links are tapped by active monitors. Second, 
the IP network uses OSPF as a routing protocol, and it comprises a single area 
of OSPF. Besides, a Topology object maintains the same topology information 
as that of OSPF. Third, all links between two routers are symmetric and the 
band widths of both directions are the same. 




Fig. 8. Experiment Network Configuration. 



(2) Simple Algorithm 

Since all links are monitored, the simplest algorithm is to load DOSfileter 
objects to all active monitors. Getting the results whether a DOS packet is 
detected, a manager knows the links on which DOS packets are transferred, 
and finally knows the nearest link to the attacker by combining the links. This 
algorithm is straight forward and requires many message exchanges between a 
manager and active monitors. Since the network of Fig. 8 consists of 28 links, 
28 message exchanges happen. 

(3) Advanced Algorithm to Reduce Message Number 

We have designed an advanced algorithm using the topology information in 
order to reduce the message exchange number. Since the network uses OSPF 
and the bandwidths of the both direction links are the same, an IP packet takes 
the same route both from a source to a destination and from a destination to a 
source. Therefore, all packets sent to a target host of LAN NWI are transferred 
on routes of the SPF (Shortest Path First) tree whose route is the LAN NWl, as 
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shown in Fig. 9. Using this tree, DOS attackers are detected using less messages 
than the simple algorithm. The DOSApplication detects DOS attackers in the 
following way: 




Fig. 9. Detection Using SPF Tree. 



(i) The DOSApplication object calculates the SPF tree by invoking a method of 
a Topology object. Then, it calculates a maximum hop number among all routes 
from LANs to LAN NWl. In Fig. 9, the maximum hop number is 4. 

(ii) It calculates the half of the maximum hop number so that DOSFileter objects 
are loaded to the next links to the middle routers of the SPF tree. In Fig. 3, the 
links of the 3rd hops are the links. 

(iii) It loads DOSFilter objects to active monitors which tap the links of the 
calculated hop number. It then gets the analysis results. When the DOSAppli- 
cation object gets links on which DOS packets are detected, it loads DOSFilter 
objects to active monitors of the next hop links to the detected links. In Fig. 9, 
the link between IP3 and NW3 and that between IP3 and NW4 are the links. 
This procedure is repeated until when all routes of the DOS packets are fixed. 

In Fig. 9, when the 4th hop links are checked, the algorithm finishes and 
detects an DOS attacker at the link between IP3 and NW3. In this example, 
only 6 message exchanges happen although 28 message exchanges happen when 
using the simple algorithm. 

We have actually performed the experiment using the network configuration 
shown in Fig. 8. The DOS attacker can be detected in about just 2 seconds. 



4.2 Performance Evaluation 

The performance of packet capture and filter has been evaluated in the following 
way: An active monitor runs on a PC with a 1.3GHz Pentium III CPU and 512 
Mbyte memory. The PC taps a FastEthernet link and a LAN tester (IXIA) sends 
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packets at the fix rate changing packet sizes. The ExAnalysis object of section 
3 runs and it filters the packets using source and destination IP addresses. The 
result is shown in Table 2. Each column shows the packet capture speeds in the 
two forms: the captured packet number per second and the capture throughput. 

Table 2. Performance of Active Monitor. 



Packet Size 
Bytes 


64 


128 


256 
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1024 


Capture Speed 
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7400 
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Mbit/s 


3.78 


6.25 


8.21 


10.12 


10.16 



5 Discussion 

5.1 Usefulness of Active Monitor Network 

(1) Flexibility and Programmability 

The introduction of active network techniques into traffic monitors and a 
network monitoring system is quite successful. The programmability of traffic 
monitor is so useful that the DOS attacker detection application can be imple- 
mented in just two weeks by one Java programmer. The short turn around time 
of a network monitoring application development is one of the most important 
advantages of active monitor networks. If the application were developed from 
the scratch, it would take more than half a year because the platform devel- 
opment took half a year. The good productivity is achieved by the following 
aspects: First, the active monitor platform provides abundant methods for an- 
alyzing captured packets such as the packet filter and the get/set of protocol 
parameters. This improves the productivity of Analysis class. For example, an 
ExAnalysis class is about just 80 line long. Second, the client-server style com- 
munication similar to SNMP is simple, it is easy for programmers to write a 
network monitoring application program. Third, Java itself provides the high 
productivity. 

(2) Topology Information 

Topology information is quite useful for programmers to write an Applica- 
tion object which analyzes a monitored network in consideration of topology. 
As for the DOS attacker detection application, the topology information is suc- 
cessfully used to reduce the number of links on which active monitors must 
monitor. Besides, the topology information is expected to be used by many net- 
work monitoring applications. For example, it will be used to calculate a network 
performance on each route. 
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(3) Performance 

The performance of active monitor is not so high. However, the performance 
is enough for many network monitoring applications. For example, the DOS at- 
tacker detection application of this paper works correctly as long as some portion 
of packets is captured. In order to improve the performance, a performance bot- 
tleneck object, i.e., a Jpcap object, need be improved. As shown in Table 2, the 
packet capture number per second decreases as the packet size increases. This 
decrease is due to the garbage collection of JVM. Since the Jpcap object uses 
more memory for a larger packet than for a smaller packet, garbage collections 
more often happen for the larger packet size. Therefore, we plan to rewrite the 
Jpcap object in other programming languages such as C or assembly languages. 
This reduces the garbage collection times, and improves the performance. 

5.2 Comparison with Related Work 

(1) Traditional Traffic Monitors 

So far, many traffic monitors mm were developed. The traffic moni- 
tors collect the statistics from captured packets. None of the monitors are pro- 
grammable; in other words, our active monitor is the first programmable traffic 
monitor in the literature. Besides, by writing programs, our active monitor can 
provide any function provided by other traffic monitors. 

(2) Active Network Management System 

Many active network management systems | l6IYI?Sltl IDj were developed. These 
systems make traditional manager-agent based network management systems 
programmable. The systems make an agent storing MIB information program- 
mable. For example, a packet of SmartPacket p| contains a program for detecting 
equipment errors or recovering from a failure, and the program is executed on 
the agent, which reduces the communication between a manger and an agent. 
However, these systems just focus on intelligently handling of MIB. They do not 
provide the functions for handling the traffic statistics which are not stored in 
the MIB. 

ANCORS ^0] is an adaptable network control and reporting system which 
merges network management and distributed simulation. The system provides 
user-definable network monitoring capabilities. Although the analysis of moni- 
tores result is programmable, the system just uses the existing traffic information 
such as MIB and RMON-MIB inforamtion. 

ANMOS monitoring tool HU is the first active network management system 
which focuses on network monitoring. However, this tool focuses on monitor- 
ing an active network itself in order to know how an active network behaves. 
Besides, it does not make traffic monitors programmable. On the contrary, our 
active monitor makes a traffic monitor programmable so that network operators 
can collect traffic information which could not be gotten unless parameters and 
sequences of captured packets were analyzed. 

(3) Active Network Platform 

An Analysis program is executed every when a captured packet is received. 
This mechanism is similar to capsule based active networks where a program 
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in a capsule is executed when a capsule is received by an active node. PLANet 
(Packet Language for Active Networks) ANTS m and SC Engine are 
examples of capsule based active networks. The implementation of our active 
monitor network is similar to ANTS; however, the objectives are different. The 
other capsule based active networks make routers programmable; however, our 
active monitor network makes traffic monitors programmable. 

6 Conclusion 

In order to achieve a flexible network monitoring system, we propose the intro- 
duction of active network approach into traffic monitors. We have developed an 
active monitor network which consists of programmable traffic monitors and a 
manager. We have also developed a DOS attacker detection application by writ- 
ing an analysis program and an application program. The following results are 
made clear from the above developments. 

— The introduction of active network approach into traffic monitors is useful to 
achieve a flexible network monitoring system. The programmability of traffic 
monitors makes the productivity of a network monitoring application high. 
For example, the DOS attacker detection application has been developed in 
two weeks. 

— The introduction of topology information of a monitored IP network into a 
network monitoring system is useful to make a network monitoring system 
intelligent. For example. The DOS attacker detection application decreases 
the message number between a manager and active monitors using a shortest 
path first tree calculated from the topology information. 
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Abstract. Active networks have been recently highlighted as a key enabling 
technology to obtain immense flexibility in terms of network deployment, 
configurability, and customized packet processing. However, this flexibility is 
often achieved at the cost of router performance. In this paper, we present a 
three-level node architecture that combines flexibility and high performance of 
network nodes. We design and implement an active network application for 
real-time speech transmissions on top of this three-level platform. In our 
application, plug-in modules are downloaded onto certain network nodes to 
monitor packet loss rate of voice streams and to perform application-specific 
packet processing when necessary. In particular, we propose to perform loss 
concealment algorithms for voice data streams at active network nodes to 
regenerate lost packets. The regenerated speech data streams are robust enough 
to tolerate further packet losses along the data path so that the concealment 
algorithms at another downstream node or at the receiver can still take effect. 
We call our approach reactive concealment for speech transmission to 
distinguish it from concealment performed at the receiver and also proactive 
schemes like Forward Error Correction. Our approach is bandwidth-efficient 
and retains the applications’ end-to-end semantics. 



1 Introduction 



We present a three-level active network node architecture to fulfil the requirements 
for necessary flexibility without impairing the routers’ performance. The architecture 
is composed of three different levels. The fixed part contains components for 
forwarding functionality and QoS primitives. These components are optimised and 
static because of performance reasons. The programmable part encompasses the 
interfaces of the fixed part and provides abstractions of the fixed part as well as an 
open interface to the higher level. The active part offers a limited execution 
environment for lightweight active code. Our three-level architecture is very 
interesting for vendors of legacy routers since their proprietary interfaces can be 
wrapped around by a generic interface and thus integrated and migrated to an active 
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network environment. A prototype of our architecture has been implemented on top of 
Hitachi’s high-speed routers GR2000. 

Based on our network node architecture, we have designed an application for real- 
time speech transmission. Recent application-level techniques like Adaptive 
Packetization and Concealment (AP/C) have demonstrated that the perceived speech 
quality can be improved by exploiting speech stationarity [Sann98]. However as 
AP/C exploits the property of speech stationarity, its applicability is typically limited 
to isolated, i.e. non-consecutive losses. When the rate of burst losses is high, AP/C 
does not achieve any significant performance improvement compared to other 
techniques. Furthermore, it has been shown that speech quality drops significantly in 
the occurrence of burst losses [GS85]. We believe that this is the point where 
flexibility provided by active network nodes can be exploited to help applications at 
end systems to perform better. Our node architecture offers sufficient flexibility for 
programming hardware components via a generic interface to perform application- 
specific tasks. These tasks include monitoring of RTP streams to measure packet loss 
rate and enforce QoS primitives implemented in hardware to give these streams 
higher priority only when it is necessary. When the packet loss rate exceeds a certain 
threshold, a plug-in module of concealment algorithms is downloaded and performed 
at certain network nodes to regenerate lost packets and to inject them into voice 
streams. The network nodes are programmed via an open interface to perform 
application-specific processing in software only for the specified voice streams. Other 
packets passing through the network nodes are forwarded in hardware to retain high 
performance. 

The rest of this paper is structured as follows. In section 2 we briefly review 
related work. Section 3 presents our three-level active node architecture that we 
developed within the BANG project [BANG]. We also discuss AP/C algorithm that 
we download to the active network nodes to perform loss concealment for voice 
streams as an application on top of our three-level active network platform. We then 
present our approach of placing active network nodes at certain locations within the 
network to leverage the efficiency of the receiver’s concealment performance. Section 
4 shows the results of a simulation study to evaluate the efficiency of our approach. In 
section 5 we describe the prototype implementation of our three-level active network 
node architecture. Finally, in section 6 we give conclusions of our work and outline 
potential future work areas. 



2 Related Work 

The concept of active networks allows users to execute codes on network nodes to 
meet their application-specific requirements. Another concept is to have well-defined 
open interfaces to network nodes and to separate their internal states from signaling 
and management mechanisms. This concept proposes the programmability of network 
nodes and thus reduces the operations that users are allowed to perform inside the 
network. A survey of research projects on active and programmable networks can be 
found in [TSS97] and [CKVV99]. 
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Applying the concepts of active and programmable networks, application-level 
performance can be improved significantly thanks to the network nodes’ application- 
specific packet processing. This is especially true for multimedia data that has a 
specific flow structure. Typical examples for application-specific packet processing at 
network nodes are media transcoding [AMZ95], media scaling [KCDOO], packet 
filtering [BET98], or discarding [BCZ97] for video distribution on heterogeneous 
networks with limited bandwidth. Surprisingly, there are very few active network 
projects that exploit the active nodes’ capability of application-specific packet 
processing to improve quality of Internet voice or audio transmissions. The work we 
are aware of is [BET98], [FMSB98], [MHMS98]. In [BET98] active nodes add an 
optimal amount of redundant data on a per-link basis to protect audio streams against 
packet loss. [FMSB98] and [MHMS98] describe the use of so called protocol boosters 
that allow dynamic protocol customization to heterogeneous environments within the 
network. A protocol booster can run on an active node and performs forward error 
correction to improve transmission quality over a lossy link. Similar concepts 
although not explicitly targeted for active networks are described in [BKGMOO]. 

Since most packet losses on the Internet are due to congestion (except for wireless 
networks), we argue that it is not the most efficient method to transmit redundant data 
on a link that is already congested. We propose an approach where application- 
specific packet processing is performed at an uncongested active node to regenerate 
audio packets lost due to congestion at congested upstream nodes. Furthermore, the 
efficiency of our algorithm for lost packets’ regeneration can be significantly 
improved when it is combined with other programmable modules. These modules 
include DiffServ and monitoring services that are implemented within our 
architectural framework. Other programmable modules are being designed and 
implemented. 

Following the concept of programmable networks, the IEEE Project 1520 [PI520] 
is an effort to standardize programming interfaces for network nodes. It defines a 
structure of four layers with open interfaces for network nodes: physical element level 
(PE), virtual network device level (VNDL), network generic services level (NGSL), 
value-added services level (VASE). The interfaces of these four levels are called 
CCM (Connection and Management), L (lower), U (upper), and V (value-added) 
interfaces. Recently, it has been proposed to split the L-interface into a generic 
abstraction interface (L-) and a service specific abstraction interface (L-I-). The 
architecture presented in this paper is related to the PI 520 framework proposed in 
[LDVS99]. The P1520 interfaces described in [RBWYKOO], [BVKVOO] map quite 
well with our architecture. Currently further development of our architecture is being 
carried out within the FAIN project [BANG], [FAIN], [GPLDOO]. 



3 A Three-Level Active Network Node Architecture 



Despite immense flexibility gain, implementing low-level forwarding and QoS 
primitives within active components do not seem to be realistic in the foreseeable 
future due to performance constraints. This consideration leads to the introduction of 
our three levels. The three-level architecture achieves the necessary flexibility without 
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impairing the router’s performance. The key design of our architecture is to de-couple 
the control software from the forwarding functionality implemented in hardware. The 
fixed level of our architecture contains static and optimized forwarding components 
with QoS primitives that cannot be made programmable due to performance reasons. 
In our node architecture, data packets that make up for the majority of packets 
flowing through a network node are processed directly by hardware in the fixed level. 
The programmable level exploits the primitives and high performance of the fixed 
part to provide end-to-end services with an open interface. Control packets without 
application-specific requirements are serviced by generic network mechanisms in the 
programmable level. In order to provide a good perceived quality, multimedia 
applications typically require that a certain end-to-end quality of service is 
guaranteed. The programmable part can fulfil the requirements of quality of service 
by enforcing Differentiated Services primitives [BBC98] in a co-ordinated way 
between network nodes. Monitoring of streams’ packet loss rate is another important 
service of the programmable level. 

Exploiting the monitoring service, the application level can download and execute 
plug-in modules only on necessary network nodes. In our architecture, services of the 
programmable level are implemented on top of the fixed part. The node-local 
interface to the fixed part is implemented by establishing an automated telnet 
connection to Hitachi’s gigabit routers to perform the necessary router configuration. 
By doing this, we separate the data path and the control path to avoid performance 
loss. The programmable level is in turn controlled by the active level via an open 
interface. The active level offers a limited execution environment for lightweight 
active codes. Lightweight active codes are usually application-specific algorithms. 
Lightweight active codes use the module interfaces of the programmable part to 
access the functionality of the fixed part implemented in hardware. Lightweight active 
code typically contains function calls or simple scripts to the module interfaces of the 
programmable part with specific parameters. Other active codes contain simple 
instructions for downloading programmable modules onto network nodes. After these 
modules are downloaded, subsequent active codes carry along only parameters 
necessary to configure these modules. 



3.1 Mapping of P1520 Interfaces to the Three-Level Active Node Architectnre 

A network node in the three-level active node architecture consists of a hardware- 
dependent and a hardware-independent part. The main idea here is to develop a 
framework for programmable routers via a generic interface. However this generic 
framework exploits hardware specific features (such as the QoS primitives of Hitachi 
GR2000 gigabit router) to achieve high performance. Thus, the network node’s 
programmability is generic while active codes can still exploit network node’s 
specific features. 

Our architecture stands in close relation with the proposed PI 520 framework 
[LDVS99]. Fig. 1 shows a view on the three-level active node architecture with 
relation to the PI 520 interface specifications. In our architecture, an automated telnet 
connection between the GR2000 router and a PC router controller is the interface 
between the fixed and the programmable level. The GR2000 on one side of the telnet 
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connection forms the active node’s hardware-dependent part. The PC router controller 
is a Linux box running the active node’s software on the other side. It furthermore 
offers limited execution environments for plug-in modules that could burden the 
GR2000 resources otherwise. The telnet interface is roughly equivalent to the CCM- 
interface of P1520. On the hardware-dependent side, it is accessed by a telnet C 
library that enables the configuration of the GR2000 hardware’s QoS and filtering 
primitives. This library is in turn wrapped by a Java library to leverage active code 
from hardware and operating systems’ dependencies. The GR2000 Java interface is 
used to configure low-level router primitives. This interface could be identified to be 
at the L- level because it abstracts from the device-specific GR2000 interface, yet it 
does not completely fit into the PI 520 framework. On top of the L- interface higher 
level QoS modules are implemented. These modules first abstract from specific 
devices and offer an interface to software architectures at the L-l- level. They have 
purely local as well as service-specific semantics. 



Corresponding 



Limited Execution 




•QoS Configuration 
•Fiiter Configuration 



Fig. 1. Three-Level Active Node Architecture. 

Active network applications trigger the installation of active components using the 
end-to-end V interface. The installation consists of transporting the active components 
to the remote active nodes using a mobile agent. The transport mechanism is 
implemented by a Java mobile agent platform that was modified for this special 
purpose. Application designers are shielded from implementation details of the 
mobile agent platform and only have access to a network device via an abstract 
interface. The virtual network device level realizes the code transport mechanisms to 
the local node and limits the access to local resources of the GR2000 routers as well 
as the PC router controller. Thus, the L interface can be used by application designers 
to access resources of the GR2000 routers and the PC router controller without 
detailed knowledge about code transport mechanisms or router configurations. Under 
the control of a host manager the active components are installed and executed. At the 
U level network-wide services can be accessed by active code. Finally, interfaces at 
the V level allow applications to combine services of lower levels. For example, an 
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application can configure and access the meter modules of programmable nodes to 
estimate the packet loss rate along the path of a multimedia stream. Having this 
information, it can either enforce the routers’ QoS capability at the bottleneck link or 
download and perform a concealment algorithm at another programmable network 
node downstream of the bottleneck link. 



3.2 DiffServ Module 

To achieve interoperability with standard QoS control mechanisms, programmable 
routers’ interfaces allow resource control by traditional QoS control as well as by 
components dynamically deployed by active networks. The three-level active node 
architecture is designed to offer such an open QoS control interface. Fig. 2 shows the 
investigated architecture. It is based on our three-level active node architecture, 
shown on the right side of the figure, but extends the proprietary GR2000 router 
interface with a standardized DiffServ interface for QoS control specified by the L-l- 
level. 



QoS Control Application 
DiffServ i BANG Platform 

r 1 } 

Netlink Sockets (Linu)^ QoS) Mobile Agent Code Transport 



GR2000 C-Interfalce 



z=> 



Linux kernel 



GR2000 



Router Controller 



Fig. 2. Three-Level Active Node Architecture and DiffServ Interface. 



Instead of accessing the GR2000 via a proprietary and router vendor specific 
interface, active components control the router via a DiffServ API. The DiffServ API 
also enables non-active QoS control applications to configure a GR2000 router. On 
one side, this API allows an easy realization of the proposed architecture. On the 
other side, it allows future porting of the architecture to other routers. Comparing our 
architecture with the PI 520 interface proposal, the L interface is divided into two sub- 
layers: L-f providing a standardized QoS control interface and L- providing a mapping 
from the DiffServ layer to the router specific API. 

This DiffServ interface definition was implemented using the C programming 
language. The DiffServ C-inferface allows programs to control the router by setting 
and deleting the DiffServ structures defined by the L+ interface. Because active code 
is implemented in Java within our three-level active node architecture, it is necessary 
to wrap the developed C-interface by a Java interface providing similar functionality, 
i.e. implement the L-f IDL definition in Java. Thus, legacy programs can use the C- 
interface whereas active programs may use the Java-interface. 
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Fig. 3. DiffServ Module Structure. 



Fig. 3 shows the internal layering of the DiffServ module. Note that besides access 
via the L+ interface, it is also possible to use other lower interface levels. In addition 
to the PI 520 L- interface also intermediate levels exist, namely the GR2000 C (or 
Java) interface and the ‘tc’ tool which is a well-known traffic control tool to configure 
the Linux QoS interface. The ‘tc’ tool is implemented beyond the Linux netlink 
interface which can be considered equivalent to the CCM interface. The DiffServ 
implementation presented in this paper realizes a standardized interface for QoS 
control on Hitachi’s GR2000 router. The DiffServ interface is implemented in Java 
and C for providing access to traditional as well as active management components. 
Section 5 will demonstrate how active networks can be used to provide QoS to 
multimedia applications by autonomous components deployed in case of network 
congestion. 



3.3 Active Meter Module 

In traditional metering systems, data is gathered by distributed meters in the network. 
The metering data is collected by readers and transferred to some management entity 
on an end system. The RTFM architecture [RFC2063] is an example for such a 
traditional system. In an active network we have to deal with mainly two additional 
issues. First, active networks enable the rapid deployment of new protocols and 
services within the network. This means that even end users might be able to 
introduce new protocols and services. Active metering must deal with these new 
protocols and services. Therefore an active meter must be flexible and extensible. In 
the extreme case that the user is able to deploy a new protocol, it must be possible to 
meter this protocol for testing, accounting and management purposes. 

The second issue arises from the possibility of having active code running on the 
active nodes. If this active code is performing active QoS enforcement or 
improvement (Protocol Enhancing Proxies (PEPs) [BKGMOO] or Protocol Boosters 
[FMSB98]), it needs access to local metering data. One example for a PEP is the 
Adaptive Packetization and Loss Concealment (see section 3.4) method which 
improves Internet voice transmission in the case of packet loss. A PEP needs access to 
metering data for two reasons. Firstly, the PEP must decide when to become active. In 
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the case of no loss or very high loss rates, activation of AP/C does not improve the 
voice quality but instead introduces additional delay and wastes CPU and memory 
resources on the active node. Secondly, the PEP operation might be parameterized 
through metrics measured. In the case of improving the quality on links with packet 
loss. Forward Error Correction (FEC) can be used. To avoid wasting resources, the 
amount of redundancy generated must depend on the current measured loss rate. 

General requirements for a meter system are speed and efficiency. This means that 
the effort for metering should be low compared to the effort of packet forwarding. 
Active metering should have a minimal impact on the performance of the active node. 
Therefore we propose native code modules as the most promising tradeoff between 
flexibility and performance. The selection of programmable modules for a certain task 
is specified in a rule set language. This approach also supports a heterogeneous 
infrastructure assuming that we can provide native modules with the same function 
for each device. In that sense the meter is active because it can be dynamically 
extended and enhanced with new modules providing the needed functionality. 

The design of the active meter maps to the PI 520 architecture described in 
[LDVS99]. For the L- Layer it was decided to use Linux NetFilter [RussOO]. Fig. 4 
depicts the architecture with relation to the PI 520 layers. The basis of the architecture 
is the Linux kernel. The CCM interface separates the NetFilter code from the rest of 
the kernel code. It is not a clear interface but rather consists of a number of function 
calls. On top of the CCM interface is the NetFilter code. NetFilter is controlled via a 
userspace tool called iptables . This tool represents the L- interface to the NetFilter 
classification functionality. On top of the L- interface, the meter core consists mainly 
of a flow table and a ruleset manager. The control interface allows the management of 
the ruleset within the meter. Rules can be added, modified or deleted. For defining 
rulesets a simple ruleset language has been defined. The data interface allows access 
to the meter data. Based on the flow specification the metered data like byte and 
packet counter or the current packet loss rate can be retrieved. The control interface 
and the data interface together form the L+ meter interface. The active meter has been 
implemented in C. 





Meter Control Meter Data 

Interface Interface 


Meter Core (Flow Table, Ruleset) 
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NetFilter Kernel Modules 
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Fig. 4. Active Meter Module Structure. 



The loss rate measurement functionality of the active meter can be exploited by 
modules performing active QoS enhancements. In this paper we propose the Adaptive 
Packetization and Concealment (AP/C) algorithm for improving voice transmission 
over links with moderate loss rates. Since the GR2000 is not capable of metering on 
flow level granularity, the active meter resides on the router controller. A small set of 
rules is enforced in the GR2000’s hardware. Packets that match the rules are passed to 
the router controller for metering purposes. Other packets are forwarded by the 
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GR2000 as usual. The router controller meters the packets passed by the GR2000 and 
routes them back to the GR2000 via another network interface to avoid packet 
looping. Active codes can access the meter module to set rules and retrieve metering 
data via an open interface. For example a mobile agent can be used to set up or 
change the meter rule set or to read data and transport it to another system. 



3.4 Loss Concealment Modnle 

AP/C (Adaptive Packetization / Concealment) exploits the speech properties to 
influence the packet size at the sender and to conceal the packet loss at the receiver. 
The novelty of AP/C is that it takes the phase of speech signals into account when the 
data is packetized at the sender. AP/C assumes that most packet losses are isolated. In 
AP/C, the receiver conceals the loss of a packet by filling the gap of the lost packet 
with data samples from its adjacent packets. Regeneration of lost packets with sender- 
supported pre-processing works reasonably well for voiced sounds thanks to their 
quasi-stationary property. Regeneration of lost packets works less well for unvoiced 
sounds due to their random nature. However, this is not necessarily critical because 
unvoiced sounds are less important to the perceptual quality than voiced signals. 
Since the phase of the speech signal is taken into account when audio data is 
packetized, less discontinuities than for conventional concealment algorithms are 
present in the reconstructed signal. 

Since AP/C assumes that most packet losses are isolated, it does not obtain any 
significant performance improvement compared to other techniques when the rate of 
burst losses is high. We believe that this is the point where the active nodes’ 
capability of application-specific packet processing can be exploited to help 
applications at end systems perform better. Since the burst loss rate of a data flow at a 
network node is lower than at the receiver, the AP/C concealment algorithm works 
more efficiently and more lost packets can be reconstructed when concealment is 
performed within the network rather than just at end systems. We thus propose to 
download and perform the AP/C concealment algorithm at certain active nodes where 
the number of burst losses of a voice data stream is sufficiently low to regenerate the 
lost packets. The regenerated audio stream is robust enough to tolerate further packet 
losses so that the AP/C concealment algorithm can still take effect at another 
downstream active node or at the receiver. 

The idea of the active network application is demonstrated in Fig. 5. The AP/C 
sender algorithm is performed to packetize audio data taking the phase of speech 
signals into account. Along the data path, packet 2 and 4 are lost. Exploiting the 
sender’s pre-processing, the AP/C concealment algorithm is applied at an active node 
within the network to reconstruct these lost packets. Downstream of the active node, 
another packet is lost (packet 3) which is easily reconstructed at the receiver. In this 
scenario, active concealment reconstructs six lost “chunks” (a chunk is a logical unit 
of speech identified by the AP/C sender algorithm; in Fig. 5 they are designated by 
C 2 b C 22 , C 3 b C 32 , C 41 , and C 42 ) and clearly outperforms the receiver concealment 
[Sann98] which can only reconstruct at most two chunks (C 21 and C 42 ) due to the burst 
loss accumulated along the end-to-end data path. 
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Fig. 5. Active Concealment. 



Thanks to flexibility offered by active nodes, the plug-in module for concealment 
can operate in a spectrum from the observer mode (only regenerate lost packets) to 
the proxy mode (buffer or recover multiple packets and then forward them). Observer 
mode consumes less CPU and memory resources of an active node but proxy mode is 
more powerful. 
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Fig. 6. Active Nodes in Observer Mode. 

In the observer mode, an active node buffers only the packet with the highest RTP 
sequence number of a flow. When the gap in the RTP sequence number between a 
new packet and the currently buffered packet is larger than one, the active node 
assumes that a burst loss has occurred upstream. It throws away the old packet and 
buffers the new one. The underlying assumption here is that out-of-order packets are 
rare. An active node delays the current packet until the lost packet has been 
reconstructed with AP/C to avoid reordering. The advantage is that downstream 
active nodes can avoid duplicate concealment. Duplicate concealment would happen 
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if a single packet is lost and every active node along the path attempts to conceal the 
lost packet when it sees the previous and the next packet. Fig. 6 shows a scenario with 
two active network nodes in the observer mode. Since the lost packet is reconstructed 
from its previous and next packet, it suffers an additional delay. This additional delay 
is exactly the delay between two subsequent packets. The additional delay incurred by 
packet regeneration is demonstrated in Fig. 7. Let d be this delay, n be the number of 
times a packet can be regenerated, and D be the maximum playout budgel[]of the 
receiver. We have the constraint n*d < D. Thus, the number of active nodes along the 
path of a voice flow should be smaller than Lo/dJ. Otherwise, packets regenerated 
more than [D/dJ times are discarded by the receiver because they arrive later than 
their playout time. Consider an interactive voice application of two users between 
Chicago and Paris. Interactive voice application requires that one-way delay be 
smaller than 250 ms. The distance between Chicago and Paris is 4142 miles which 
translates into an approximates propagation delay of 4142 * 1600 / 300000000 ~ 22 
ms. Let the queueing delay be 35 ms, the maximum playout delay is D = 250 - 22 - 
35 = 193 ms. Depending on speaker’s voice, an average AP/C packet size ranges from 
80 to 160 samples [Sann98]. In this example, we choose a packet size of 120 samples 
and obtain a packetization delay of 120 / 8000 = 0.015 s = 15 ms. Thus, the maximum 
number of active nodes on a voice path between Chicago and Paris is Ll93/15j = 12. 
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Fig. 7. Additional Delay Incurred by Packet Regeneration. 



In the proxy mode, an active node buffers more than one packet of a voice flow. 
An active node can determine how long packets are buffered for the concealment 
operation. The longer packets are buffered at an active node, the more memory is 
consumed but the higher the chance of a successful reconstruction is. This trade-off is 
similar to that of a receiver’s playout buffer. When a loss gap of one packet is 
detected, the lost packet is regenerated as described in the observer mode. Upon 
detecting a burst loss larger than one packet, an active node requests its upstream 
node to retransmit the lost packets. Fig. 8 illustrates a scenario where packets n and 
n-t-1 are lost and retransmitted by an active node operating in proxy mode. Since the 
proxy mode can cope with burst loss, it outperforms the observer mode. An active 
node limits the number of voice packets kept in its buffer and replaces the old packets 
by the new ones. An active node can also periodically send its upstream neighbor an 
active packet acknowledging voice packets up to a certain RTF sequence number. An 
acknowledgement does not mean that an active node has received and forwarded all 
packets up to the specified RTF sequence number. It rather means that the active node 
does not need those packets any more. This acknowledgement mechanism helps limit 



' Playout budget is the maximum amount of time a packet can be kept in the receiver’s playout 
buffer. 
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the number of buffered packets at an active node. Let k be the maximum recoverable 
burst loss and rtt; be the roundtrip time between the /th active node and its upstream 
neighbor (in the proxy mode, the sender also participates in the ARQ process and is 
considered the 0th active node). We have the constraint 

n*(k+l)*d + rtti + rtt 2 + . . . + rtt„ = n*(k+l)*d + rtt < D 
where idt is the roundtrip time between the sender and the receiver. This inequality 
constraint presents a trade-off between the number of active node on the path of a 
voice flow and the maximum burst loss we wish to recover. Similar to the example of 
the observer mode, we have n*(k-M)*15 -i- (22-i-35)*2 < 193. Thus, n*(k-i-l) < [79/15J 
= 5. If we have only one active node operating in proxy mode on a voice path 
between Chicago and Paris, we can recover a burst loss up to four packets. 
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Fig. 8. Active Nodes in Proxy Mode. 

Our approach is similar to Robust Multicast Audio (RMA) proposed by Banchs et. 
al. in [BET98] but it acts in a reactive way upon detection of packet loss in audio data 
streams. On the contrary to RMA transmitting redundant data on a per-link basis to 
protect audio streams against packet loss in a proactive way, our approach simply 
regenerates and injects the lost packets into audio streams and thus is more 
bandwidth-efficient. Another advantage of our approach is that it does not break the 
applications’ end-to-end semantics and does not have any fiiidher demand on the 
number and location of active nodes performing the concealment algorithm^ RMA, 
however, requires active nodes to be located at both ends of a link or a network to 
perform the FEC encode and decode operation. The concept of protocol boosters 
[FMSB98], [MHMS98] is another similar approach to ours. However, in the case of 
FEC which is presented in [MHMS98] this approach also requires at least two 
instances of the same booster type within the network in order to perform the encode 



^ Clearly, the number and location of active network nodes influence the performance 
improvement. However, the applications’ functionality is not affected under any 
circumstances. 
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and decode operation in a proactive way. On the contrary, our scheme is reactive and 
only acts when necessary. 



4 Simulation Study for Active Concealment 

In our simulation, we assume that there is only one active node in the path from the 
sender to the receiver where intra-network regeneration of lost packets can be 
performed. The logical network topology for our simulation is shown in Fig. 9 where 
a lossy network can consist of multiple physical networks comprising several network 
hops. We use the Bernoulli model to simulate the individual loss characteristics of the 
networks. Objective quality measurements such as [ITU98] and [YKY99] are used to 
evaluate the speech quality. These measurements employ mathematical models of the 
human auditory system to estimate the perceptual distance between an original and a 
distorted signal^ Thus, they yield results that correlate well and have a linear 
relationship with the results of subjective tests. We apply the Enhanced Modified 
Bark Spectral Distortion (EMBSD) method [YKY99] to estimate the perceptual 
distortion between the original and the reconstructed signal. The higher the perceptual 
distortion is, the worse the obtained speech signal at the receiver is. The MNB scheme 
[ITU98], though showing high correlation with subjective testing, is not used because 
it does not take into account speech segments with energy lower than certain 
thresholds when speech distortion is estimated. In the MNB scheme, the replacement 
of a lost speech segment by a silent segment does not lead to a degradation of quality, 
because this segment is not taken into account when the perceptual distortion is 
computed. 

Sender 



Fig. 9. Simulation Topology. 

The structure of this section is organized as follows. In the first simulation step, we 
use the same parameter sets for the lossy networks. We then compare the speech 
quality obtained by the active loss concealment with two reference schemes. In the 
second simulation step, we vary the parameter sets of the lossy networks and measure 
the efficiency of the active loss concealment. The parameter sets are chosen in such a 
way that the packet loss rate observed at the receiver is constant. This simulation step 
is performed to determine to optimal location of the active node where the plug-in 
module for the active concealment algorithm can be downloaded and performed. 




4.1 Performance Comparison to Reference Schemes 

In this simulation step, we compare the speech quality obtained by active loss 
concealment with two reference schemes. In the first reference scheme, the sender 



^ We use a speech sample that consists of different male and female voices and has a length of 
25 s. 
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transmits voice packets with constant size and the receiver simply replaces data of a 
lost packet by a silent segment with the same length. Each packet in this scheme 
contains 125 speech samples, resulting in the same total number of packets as the 
second reference scheme and the active loss concealment scheme. The second 
reference scheme is the AP/C scheme applied only at end systems. Packets are sent 
through two lossy network clouds and are dropped with the same packet drop 
probability. The parameters used in this simulation step and the resulting packet loss 
rate are shown in Table 1. 



Table 1. Parameters and Packet Loss Rate Used in Simulation for Performance Comparison 



Packet drop probability 


0.03 


0.06 


0.09 


0.12 


Packet loss rate 


0.0592 


0.1164 


0.1720 


0.2257 




Fig. 10. Perfomiance Comparison to Reference Schemes (Simulation Step 1). 

Fig. 10 shows the results of this simulation step, plotting the perceptual distortion 
measured by EMBSD versus the network clouds’ packet drop probability. The MOS 
(Mean Opinion Score) axis helps the reader to interpret the results in term of 
subjective quality measurement. A MOS value of 5 indicates excellent speech quality 
while a MOS value of 1 stands for an unacceptable quality. The results demonstrate 
that the higher the packet drop probability is, the higher the perceptual distortion of 
the schemes and thus the worse the speech quality is. AP/C performs better than 
reference scheme 1 that replaces lost packets by silent segments, and the active loss 
concealment obtains the best speech quality. When the network clouds’ packet drop 
probability is low, the active loss concealment does not gain any significant 
improvement compared to the AP/C scheme. This is because AP/C performs 
sufficiently well when the network loss rate is low and the number of burst losses is 
negligible. However, when the packet drop probability rises and the burst loss rate is 
no longer negligible, the perceptual distortion obtained with AP/C increases 
significantly and the active loss concealment achieves obvious improvement 
compared to AP/C. 
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4.2 Optimal Active Network Node Location 

In this simulation step, we vary the parameters of the lossy network clouds to 
determine the optimal location of the active network node. This simulation step is 
intended to help answering the following question: „Given that there are the same loss 
characteristics along the data path, where is the most effective location to download 
and perform the active concealment algorithm?". 

The packet loss rate of a data path consisting of two network clouds with packet 
drop probability/)/ and p 2 is given by = 1 — (1 — p\) ■ (1 — /ta) . 

The result of this simulation step is presented in Fig. 1 1 using EMBSD to compute 
the perceptual distortion of the obtained speech signal at the receiver. It shows that the 
optimal location to download and perform the active concealment algorithm is where 
the packet loss rate from the sender to that location is equal to the packet loss rate 
from there to the receiver (pi = p^. If on one hand the packet loss rate from the 
sender to the location of the active node is too high (pi» pi), the active concealment 
algorithm cannot exploit its advantage in terms of the location as compared to 
concealment just at the receiver. On the other hand, if the packet loss rate from the 
active node to the receiver is too high (/»/<< pi), the concealment algorithm at the 
active node is idle most of the time, because the majority of losses happen at 
subsequent network nodes. This effect is increasingly important when the packet loss 
rate (and thus the packet drop probability) increases, leading to a higher number of 
burst losses which causes the “conventional” concealment algorithm to fail. 




pi 



-p=0.03 

-p=0.06 

p=0.09 



p=0.12 

p=0.15 



Fig. 11. Optimal Active Network Node Location (Simulation Step 2). 



5 Prototype Implementation 

We have set up a testbed for the purposes of demonstration, experimenting, and 
implementing applications that can exploit the advantages of active networks 
technologies. The heart of the testbed are three GR2000 routers provided by Hitachi. 
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These three high-speed routers satisfactorily fulfil the requirements for large 
bandwidth of novel applications and also provide primitives for QoS and filtering. 
Each of these routers is connected to a PC router controller that controls the GR2000 
via the interface described in section 3. The software of our architecture runs on the 
router controller. Metering is also performed on the router controller because the 
current version of the GR2000 has no flow based metering capabilities. Our testbed is 
shown in Fig. 12. 

In the following we describe a scenario of DiffServ service creation, deployment, 
enforcement and measurement which we have realized within the testbed. The 
scenario comprises the following features: 

• application of active concealment for voice flows when the packet loss rate is low 
(a threshold of 10% is used in our current testbed) 

• creation and deployment of state to DiffServ core routers realizing a Per-Domain 
Behaviour (PDB) 

• creation and deployment of per-flow state in DiffServ edge routers to map a flow 
to a DiffServ Code Point (DSCP) 

• flow remarking and enforcement of the DiffServ Per-Hop Behavior (PHB) using 
the automated router configuration and measurement (active meter) 

When foreground traffic (i.e. in this case it is voice traffic) is transmitted, the 
measurement system at the core router detects the foreground flow. However it is not 
able to decode the upper layer protocol (RTP). Therefore the corresponding meter 
module is requested and uploaded from the network management system. Then 
measurement data on the flow is collected. When a Per-Domain Behaviour (PDB) is 
created at the management station, it can be deployed by having mobile agents travel 
to core routers to perform the necessary configuration for the respective PHBs. When 
the packet loss rate is lower than 10%, no reservation of bandwidth is necessary and 
active concealment is applied to regenerate lost packets as described in section 3.4. 
When congestion is about to occur, it can be detected since the packet loss rate and 
routers’ queue length (among other values) are measured at the routers and 
periodically conveyed to the network management system. Collected information at 
the management system can help to detect the location of congestion early and take 
the most appropriate measures. Upon detection of congestion, mobile agents can carry 
a filter to edge routers to remark the foreground traffic flow to a certain DSCP. 
According to the deployed PHB, the core router now gives preference to packets 
carrying this DSCP. Another possibility is to have mobile agents only configure the 
routers at the congestion location to serve foreground traffic with higher priority. 
Thus, partial QoS enforcement can be combined with active concealment. The voice 
quality now remains high independent of the congestion situation in the network. 
Note that it is possible to apply the filter (as well as the core routers’ configuration) 
only temporarily in our architecture. This allows us to deploy an agent that can 
configure the router and become inactive for specified time interval. After that the 
agent becomes active again, deletes the filter, and terminates or moves to another 
network node. 

The above scenario of automatic QoS deployment demonstrates the importance of 
the decentralised control of routers. This is achieved by having mobile agents move 
from hop to hop to configure the routers. The decentralised control of core routers 
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allows for simple autonomous PDB deployment. The decentralised control of edge 
routers allows for simple autonomous enforcement of temporary Service Level 
Specifications (SLSs). Due to the agent-based approach, the system has a high fault 
tolerance in case links from the central network management system to nodes are 
temporarily down or congested. Furthermore the scenario shows how our 
architecture’s open interfaces can be used to set up a QoS configuration in an active 
network. Our DiffServ scenario demonstrates the main benefits from the active 
network technology: the autonomous time-dependent management of QoS by active 
components. 



Network Management 



Active Network 




Fig. 12. Testbed Configuration. 



6 Conclusion and Future Work 

In this paper we presented a novel three-level active node architecture which consists 
of a fixed, a programmable and an active part. Our active node architecture achieves 
both flexibility and high performance that are the core requirements for a dynamic 
network infrastructure. Our prototype implementation bases on Hitachi high-speed 
routers GR2000. However, the flexibility of our network node architecture enables 
easy plug-in of other high-speed routers. The programmable part currently consists of 
a Diffserv and a meter module. The QoS module for DiffServ as part of our three-level 
active node architecture allows for interoperability with standard QoS control 
mechanisms, which is essential for the acceptance of the new active network 
technology. The standard interfaces allow the control of networks by traditional QoS 
control as well as by components dynamically deployed by active networks. We have 
specified and implemented a Differentiated Services module that allows to abstract 
from a particular network device and to use DiffServ semantics to program the device. 
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The meter module uses native code modules that can be dynamically loaded to extend 
the meter functionality. This allows for high flexibility and extensibility as well as 
performance. A generic ruleset language is used for specifying meter rules and 
corresponding modules independently from the architecture of the network elements. 
Mobile agents can be used for the transport of rulesets to the meter devices or for 
reading meter data and transport them to other systems. Other plug-in modules are 
being designed and implemented. 

Another plug-in module for reactive concealment of voice streams is currently 
being implemented. With this module we design a new active network application for 
voice over IP that exploits the flexibility of active networks to perform application- 
specific packet processing. Simulation results have demonstrated that significant 
speech quality improvements can be achieved compared to pure end-to-end 
application-level algorithms. An unoptimized software implementation of the active 
loss concealment reconstructs a lost packet with an average execution time overhead 
of 220 |J,s on a PC with a Pentium III 500 MHz CPU and 128 Mbytes RAM. Since the 
active node only performs packet regeneration for a small portion of packets of voice 
streams, the average consumption of node resources is reasonably low. The more 
complex encoding can be done at the sending end system that usually has sufficient 
processing power because the bandwidth overhead due to encoding is very low. 
Together with active metering this approach allows for automatic QoS improvement 
for voice transmission in an active network environment. 

Additional future work includes the investigation of active network applications 
where a number of active nodes can be placed along the data path to download and 
perform the active loss concealment algorithm. Besides, it is very interesting to 
attempt to answer the question how well and how many times active loss concealment 
can be performed in a recursive way. Furthermore, since both application-level 
Forward Error Correction and application-specific packet processing incur additional 
consumption of network resources, we plan to compare these two approaches. The 
result of this comparison might enable an optimal combination of the two approaches 
to obtain further improvement of speech quality. A further step is to implement plug- 
in modules for audio compression and Forward Error Correction. An integration of 
these modules in the existing programmable level allows significant flexibility and 
efficient coordination with other implemented modules. 
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Abstract. Active Networking (AN) involves the processing of programs 
in heterogeneous networking environments. There are several AN solu- 
tions, exposing different APIs and using different languages, and each 
may be appropriate for different tasks such as high-speed multimedia 
processing or low-speed routing adjustments. 

We describe our active node system, LANode, that separates control- and 
data-plane activities, and introduce Component Compatibility Markup 
Language (CCML), a critical component of LANode that allows it to be 
applied to heterogeneous platforms. 



1 Introduction 

Active Networking (AN) is an approach to providing new network services with- 
out major changes to network infrastructure or formal standardisation. Instead, 
code is dynamically installed into network nodes to replace or augment their 
basic function of routing or switching m- Areas of AN research include code- 
installation methods, portability and security versus functionality in the choice 
of language or programming environment, and packet delivery (virtual networks 
versus packet interception). 

At Lancaster, we are developing LANode, an active platform which can be 
adapted for, and deployed on, heterogeneous hardware, operating systems and 
AN environments. LANode can provide node-specific services, and active ap- 
plications on LANode adapt themselves to these services to expose their own 
node-independent services. We deal with node heterogeneity using a form of 
content negotiation involving an XML document format, CCML (Component 
Compatibility Markup Language). 

We report on recent AN technologies in Sect. Q LANode and CCML are 
detailed in Sects. El andS Our planned developments and research activities are 
described in u 

2 Active Networking 

The aim of active networking is to allow new services to be dynamically deployed 
in a network without changing the core functionality of the network components 
or consulting standardisation bodies. Programs to support the services are loaded 
into the network nodes to process traffic intended for them, with varying levels 
of dynamism, and of restriction by the network operators. 
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2.1 AN Characteristics 

Several AN architectures and environments have been proposed and developed. 

We now describe some of the various dimensions of these systems. 

Program installation. In the discrete approach, programs are loaded onto 
nodes that particular traffic is anticipated to pass through. In the in-band 
(or capsule-based) approach, programs exist (or are referred to) within the 
traffic, and loaded as required. 

Planes. Network services may process traffic in various ways. Some may process 
large volumes to control network load (e.g. as filtering does) or to service 
heterogeneous clients (e.g. transcoding) — we describe these as data-plane 
activities. Other services may process control information that subsequently 
affects larger volumes of traffic (e.g. routing), or may gather the information 
about such traffic — these are control-plane activities. 

Different programming environments are suited to different activities. A 
machine-code, zero-copy, kernel-space environment is more appropriate for 
high-speed, data-plane activities, whereas high-level languages with exten- 
sive data abstractions and libraries are better for sophisticated control-plane 
activities. 

Integrity. Safety and security are vital if a network’s hardware is to be ex- 
posed to programs performing unforeseen activities. These may be achieved 
through use of restricted programming languages, access-controlled libraries 
(the sand-box model), and formally verified programs. 

Traffic capture. AN environments must deliver traffic to their installed pro- 
grams by some means. Traditionally, packets are generated within the active 
network, to be passed across a virtual network overlaying a conventional 
network, but packets may also be intercepted as they traverse a node using 
conventional protocols. In the latter case, the environment must provide fa- 
cilities for specifying which packets to intercept (e.g. Berkeley Packet Filter- 
ing (BPF) [T^h and due to the potential overhead of having large numbers 
of complex packet filters, approaches have been sought to perform efficient 
multi-field classification P. 



2.2 AN Work 

Much of the early work on AN was carried out as part of an American De- 
partment of Defense (DARPA) initiative to develop network technologies that 
were highly adaptable and robust. Some of the results from this early work are 
summarised in m An overview of some of the significant AN platform research 
follows. 



Deployment and Experimentation. The ABONE is a shared virtual net- 
work of nodes for conducting AN large-scale research [Q . Independent developers 
can provide their own execution environments (EEs) within which active appli- 
cations (AAs) can run. EEs are manually deployed across ABONE nodes using 
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a common management interface on each node, and AAs are loaded into an EE 
according to its specification. Packets traversing the ABONE are encapsulated 
using ANEP, which specifies the intended EE type by a global numeric identifier, 
so different EE developers can run experiments independently. 

Execution Environments. The ANTS toolkit PH!, developed at the Mas- 
sachusetts Institute of Technology, is one of the earliest approaches to providing 
active technologies in the network. It adopts the in-band or capsule approach to 
program installation, where a packet includes a reference to a forwarding routine, 
which is used to process the packet at each ANTS node. Java is the program- 
ming language in which capsules are programmed. This enables the capsules to 
be executed in a safe sand-boxed environment. 

The Switch Ware architecture |2j is another of the early active architectures 
to emerge from DARPA-funded research. Switch Ware packets contain both code, 
in the form of PLAN (Programming Language for Active Network) statements, 
and data. In addition to active code embedded in packets, active extensions 
(commonly used code resident on a node) enable a greater level of functionality 
and can be referenced from PLAN. Security in Switch Ware is provided using a 
combination of both cryptographic based authentication and formally verifiable 
programming languages. 

Active/Programmable Packet Processing. Unless active packets are di- 
rectly addressed to the next active node, nodes must provide mechanisms for 
efficient packet capture and processing. 

LARA-I— I- pni develops work carried out on LARA (Lancaster Active Router 
Architecture) jS]. A component based architecture has been adopted enabling a 
greater degree of flexibility than first generation active node architectures. In ad- 
dition to this, active components execute in a zero-copy, user-space environment 
with a minimal performance penalty. LARA-I — h aims to make the development 
of active programs easier because of these properties, without a significant per- 
formance hit. 

The Router Plugins jj] active router architecture is oriented around the con- 
cept of a plugin — the base unit of modularity in the architecture that performs 
some form of meaningful computation. Plugins can be dynamically loaded at 
run time and bound to specific flows to perform computation on those flows. 

Pronto m is intended to be a platform on which high level AN research 
can be trivially conducted. This is due to the fact that Pronto is based upon 
and extends the functionality of a commodity OS (Linux). Pronto is intended to 
support multiple heterogeneous execution environments. In a similar manner to 
LANode (see Sect.Ej), attention has been paid to the identification and separa- 
tion of the service specific and service generic facilities of the platform and the 
interfaces between them. 

Application-Level Active Networking. The University of Technology, Syd- 
ney has taken an alternative approach to active networking by moving it into 
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the application level They introduce the general concept of Application- Level 
Active Networking (ALAN), which involves dynamic installation of proxy ser- 
vices on suitable nodes. Active capsules come in the form of proxylets which 
can execute in an Execution Environment for Proxylets (EEP; historically also 
Dynamic Proxylet Server, DPS). 

UTS have defined a Java API that distinguishes between an initialisation 
phase and a daemon phase of the proxylet’s execution. They have also built 
an EEP implementation in Java called funnelWeb, and have constructed a vir- 
tual network of these for testing and evaluation of funnelWeb, and of various 
distributed algorithms running across proxylets P|. The use of Java ensures 
portability, and provides the security mechanism. 

funnelWeb has been used to provide multicast bridges to connect MBONE 
islands, TCP bridges to route around poor international connections, HTTP-to- 
RTP gateways to extend the functionality of web servers. 

Current work with the EEP is focused on providing application-level routing 
so that proxylets can co-operate by building meshes of communication according 
to an application’s criteria pm. 

3 LANode 

ALAN consists of proxylets performing application-dependent activities in a 
Java environment. Applications must be explicitly configured to use them, such 
as a browser being configured to use a HTTP proxy. Also, the use of Java for 
programming proxylets ensures widespread compatibility, but can be slow for 
high-volume data-plane activities, particularly if packets are copied into user 
space. 

We have chosen to extend the AL AN/proxylet approach in such a way 
that the advantages of portability and security can be retained, while the use 
of machine-optimised code and kernel-space packet processing are introduced. 
LANode allows proxylets to perform data-plane tasks efficiently, and without 
being directly addressed by the applications using them. 

3.1 Architecture 

LANode is a development of the ALAN EEP, and consists of two planes (as in 
Fig. El : 

Control plane. This offers a conventional EEP environment to run Java prox- 
ylets that provide mainly control and management functions. 

Data plane. This may offer various execution environments, including ones 
specific to processor or operating system type. Optimised for high-perfor- 
mance packet processing, most network traffic is expected to traverse this 
plane, and avoid the control plane. 

This is a common distinction found in operating systems that have both a 
general use and an IP forwarding function: packets traversing the node never 
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Fig. 1. LANode EEP Control and Data Planes. 



leave the kernel (the data plane), but utilities can modify the routing table from 
user space (the control plane). Processing in the control and data planes run 
independently of each other, except for occasional interaction between the them 
when the data plane is reconfigured, or the control plane is informed of activity. 

The EEP API is extended to provide access to the node’s resources through 
a set of named services called profiles. This set can vary from one node to the 
next, according to what it can best provide - we do not prescribe a single, broad, 
fixed interface. 

Some profiles provide adminstrative services such as a CORBA naming ser- 
vice or an RMI registry so that proxylets can present their own named services. 
Other profiles expose the data plane to configuration with varying flexibility. For 
example, exposing the node’s IP routing table allows some control over traffic. 
Greater flexibility can be obtained through a profile that allows active code to 
be installed in the data plane, e.g. a node on a Linux system may have a profile 
that allows the installation of kernel modules. Once in place, the two parts can 
interact through the profile, usually to exchange configuration information. 

Each profile has a name, a Java class/interface type, and documented be- 
haviour, ensuring that if a profile is provided on two different nodes, one can 
be certain that they behave identically. However, it is not necessary for every 
node to support every profile. The total set of profiles can be expanded as new 
services are foreseen (though clearly, a minimal set is desirable), and the naming 
scheme is hierarchical so that organisations can develop profiles independently. 

A particular node will be configured with properties describing the charac- 
teristics of the node, including a list of the available profiles. Other properties 
may indicate the operating system or the processor type. For example, a node 
may specify that its data-plane environment is for Linux kernel modules consist- 
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ing of i386 code, and along with netfilter^ support, and that there is a profile 
to manipulate the IP routing table. This information allows a proxy let to be 
optimised at install time for the node’s characteristics, including the type of 
data-plane environment provided. 

A proxylet needs to ensure that the code it loads into the data plane is com- 
patible with the environment there. One approach would be to allow a loaded 
proxylet to interrogate LANode to determine its capabilities to which the prox- 
ylet adapts. Instead, LANode adapts the proxylet before loading, through choice 
of components and installation parameters. We have chosen to allow the proxylet 
to be composed of several network-available components at installation time. The 
proxylet can be expressed as a CCML document (see Sect. ^ that lists all the 
possible components, and indicates the LANode characteristics (such as profile, 
processor type, operating system) that each component is compatible with, and 
it is this document that is submitted to the node to install a proxylet. The node 
compares its characteristics with the document, and selects a set of components 
from which to build proxylet. 

This selection allows, for example, one compilation of a Linux kernel module 
to be selected from compilations for several processor types. Independently, part 
of a proxylet may be an RMI object or a CORE A object, depending on whether 
an RMI registry or a COREA name service are locally available. 

3.2 Implementation 

LANode is a combination of two parts: 

— The core provides the extended EEP environment, along with the basic fa- 
cilities for loading, starting and stopping a proxylet through a COREA in- 
terface. 

— The stub provides the profile implementations, and is supplied to the core 
as run-time configuration. 

This approach separates the development of the EEP and the profiles. For 
example, our current core spawns a virtual machine (VM) for each proxylet, 
but a future version may manage all of them in a single VM. This choice is 
independent of the provision of profiles and other node characteristics, so the 
same stub should be compatible with both cores. 

We have built a stub designed for Linux, providing profiles to allow: 

— access to an RMI-like COREA registry for exposing services provided by 
proxylets, 

— access to the node’s IPv4 routing table, 

— modules to be loaded into the kernel, and contacted through a virtual device, 

— shared libraries to be linked to the virtual machine to support native meth- 
ods, and 

^ The netfilters kernel extension allows convenient packet interception in later Linux 
kernel versions. 
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Fig. 2. Developed Proxylets. 



— executable programs to be installed and used as required. 

The Java security mechanism is employed to check the validity of any at- 
tempts to use these profiles. For example, loading modules, libraries or exe- 
cutable programs requires a NativePermission that specifies where the code 
can be loaded from, and for what purpose. 



3.3 Applications 

By exposing the data plane to LANode proxylets, we aim to provide proxylets 
optimised for various platforms providing high-performance packet processing, 
while exposing interfaces (CORBA/RMI) to sophisticated control objects to ad- 
just that processing. The same proxylet can be loaded onto several heterogeneous 
nodes, and perform the same task efficiently. 

Proxylets developed so far (depicted in Fig. |2) include: 

— an IPv4 routing proxylet, exposing a routing table through a CORBA inter- 
face, 

— a traffic detection proxylet, exposing a CORBA interface to register an in- 
terest in packets traversing a node, 

— a packet diversion proxylet to allow packets belor^ing to particular flows to 
be intercepted and processed in the control plan^, 

— a transparent proxying proxylet (depicted in Fig. EJ which can intercept a 
TCP connection to a given host and port, and redirect it to an alternative 
host and port. 



2 



This is intended for applications where the amount of traffic to be processed is quite 
small. 
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4 CCML 

LANode is intended to be deployed on heterogeneous nodes, and must be pre- 
pared to run active code on any type of system, and it is assumed that a partic- 
ular active application has been generated several times, once for each system 
type. (LANode permits a proxylet to consist of several parts with varying degrees 
of system dependence, so the generic parts only need to be generated once.) So 
LANode requires a mechanism for selecting one of these versions which is com- 
patible with its node’s characteristics. This function is analogous to content 
negotiation, as used in the WWW, and supported through HTTP 0. 

There are two main approaches to content negotiation: agent-driven and 
server-driven. 

agent-driven. An ordinary request is met with a response listing several al- 
ternatives and their characteristics. The requesting agent must choose items 
that are compatible with its own characteristics, and make further requests, 
server-driven. The request is augmented with the agent’s own requirements, 
and the server chooses an appropriate response, and returns it directly. 

The latter involves only a single interaction, but is more difficult to cache 
in proxies. The former works better with caches, but requires at least one extra 
interaction. Whether this is significant in the case of LANode (which has the 
role of the agent) is an issue for further study. 

Component Compatibility Markup Language (CCML) is an XML ^ format 
to represent responses during agent-driven content negotiation. Instead of the 
agent (e.g. LANode) specifying its own characteristics in a request for an entity 
(e.g. a proxylet), it requests the entity’s CCML document. The agent should 
be able to resolve the document against its own characteristics, which produces 
information about a version of the entity that is tailored toward the agent. It 
is not limited to resolving to a single reference — the resultant components are 
expected to be (re-)composed by the agent to form the resource. 

An agent can combine a CCML document with its own ‘target characteristics’ 
(name-value strings), as in Fig. 0 to produce a set of compatible components, 
plus some installation properties (name- value strings). Additionally, each com- 
ponent may have an attached parameter (just a string). 



4.1 Related Work 

Component selection or content negotiation also exists in other languages. SMIL 
is intended for media components to produce multimedia presentations by spec- 
ifying their relative spacial and temporal positions P|. It also allows alternative 
media to be presented based on client characteristics such as preferred natural 
language, or bandwidth available to the client. This function, provided by SMIL’s 
switch element type, is equivalent to CCML’s function. However, the names of 
testable characteristics are expressed directly as attribute names, rather than 
attribute values, so adding new characteristics (or replacing them entirely) for 
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Fig. 3. CCML Function. 



an alternative application requires the DTD to be abandoned. Also, the mathe- 
matical relationship (‘equals’, ‘greater than’. . . ) between a characteristic and a 
given value is implied by the name, whereas CCML allows this to be expressed 
as an attribute value, in order to be more independent of its application. 

For active networks, XML has already been applied to specifying the seman- 
tics and non-functional characteristics of distributed objects, so that collabora- 
tive applications can be built with predictable behavioural characteristics m- 

4.2 Syntax and Semantics 

The syntax of CCML is defined by the XML DTD at jOj. A CCML document 
consists of a sequence of ‘actions’, many of which may be conditional. Conditions 
are expressed through ‘expression’ elements, with comparisons against platform 
characteristics forming the primitive expressions, and logical operators forming 
compound expressions. 

Processing a CCML document begins with initialising state: an empty list 
of tagged component references, and an empty set of properties (to become 
the ‘installation parameters’ in Fig. n. Each action element is then processed 
in sequence, unless an enclosing element somehow precludes it. Processing an 
action may result in: 

— an item in the installation parameters being set, reset or overwritten, 

— a component reference being added to the list, with or without a tag, 

— a previously added reference being tagged. 

Expression elements consist of: 

— REL, representing boolean relationships between configuration properties and 
supplied values, 

— LOGIC, representing multi-operand logical expressions, and containing those 
operands, 

— EXT, refering to other expressions in the current document or others, 
while action elements consist of: 
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— LOAD, which specifies the (relative) URI of a component to be added to the 
component list, 

— STORE, which associates a tag with a component URI, 

— SET, which sets the value of a global property or installation parameter, 

— BRANCH, which contains some expressions, followed by actions only to be 
performed if the expressions are all true (it is said to be compatible), 

— DECIDE, which contains only branches, only the first of which is compatible 
is selected. 

Expressions may also appear outside BRANCH elements, where only actions 
are normally expected. These are ignored except when referenced by an EXT 
element. 

It is the DECIDE element type that allows alternative components to be se- 
lected. The actions within only one of its branches are processed, and if no 
branches are compatible, the processing is aborted. 



4.3 LANode Application of CCML 

LANode contains a CCML processor, and accepts proxylets either as single JAR 
files, or as CCML documents. In the latter case, the document is processed ac- 
cording to the LANode’s properties, and the resultant list of components is used 
to form the proxylet’s class path (the tags are ignored). Furthermore, the derived 
installation parameters are made available to the proxylet as its configuration. 

If the CCML document cannot be fully processed, e.g. because there are no 
compatible combinations of the proxylet components, the installation fails. 



LANode Selection Criteria. To illustrate the use of CCML in LANode, we 
describe some typical properties that a LANode node may provide. 

profile lists profiles supported by the node. 

arch lists processor types supported in the data plane. 

These are not yet fixed, since it is not clear whether a node should be per- 
mitted to provide several similar variants of a service, particularly when there 
is more than one dimension of variance (e.g. loading a Linux kernel module on 
two processor types, and with two different sets of kernel extensions). This is a 
topic of further study. 

Depending on the values of the properties above, other criteria may be 
present, possibly refining their meaning. For example: 

linux. kernel-environment lists extensions to the Linux kernel, such as netfil- 
ters. 
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LANode Profiles. As with the selection criteria, these allocations are yet to 
be fixed. In particular, the set of profile types is expected to grow as new services 
with nodes are defined and exploited. 

nativecode.CharDeviceLoader permits loading of data-plane modules that 
provide a virtual character device, accessible from user space, for control 
purposes. 

nativecode.ProgramLoader permits loading of executable programs which 
can be accessed in Java through a Process object. 
registry.Registry permits access to a registry of local CORBA objects which 
proxylets have provided. 

Profile names are closely associated with the names of their interface types, 
e.g. CharDeviceLoader has a Java interface type of CharDeviceLoader. While 
this association is desirable, it is not a requirement. 



Adaptable Proxylets. Given the above LANode properties and profiles, we 
can construct a proxylet suitable for a wide variety of platform types by sepa- 
rating its functionality according to level of platform dependence. 

For the proxylet that performs transparent proxying, we can isolate several 
platform-independent parts - external control-interface CORBA stubs, a veneer 
interface (the veneer will adapt the node’s profiles to the transparent proxy- 
ing service), and a main function to bind them together - and specify them 
unconditionally in a CCML document with several <L0AD> elements. 

Then we can produce several veneer implementation classes, one for each 
useful profile (e.g. the Linux kernel module loader) — LANode only needs to 
select one of these, so they are listed as alternaive <BRANCH>es in a <DECIDE>. 
We can develop new classes as new profiles are defined. As well as identifying 
the appropriate JAR file with <L0AD>, CCML can be used to pass the name of 
the class within it that should be used as the implementation. 

For each of these implementations, there may be further dependencies — the 
module loader may be supplied with a module chosen from several precompiled 
alternatives for different processor types. 

This branching and refinement at each level is depicted in Fig. El 



4.4 Example 

The current implementation of our Linux-based LANode stub together with the 
transparent-proxying proxylet serve as an example use of CCML. Firstly, the 
proxylet is broken into several components as JAR files: 

— iptpidl .jar — CORBA IDL stubs for the control interface, 

— iptpproxylet . jar — core proxylet implementation and CORBA control 
implementation, 

— iptpv .jar — veneer interface to control transparent proxying. 
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Fig. 4. Graphical Structure of CCML. 



CORBA control 
interlace 




CORBA stubs/main 
veneer interface 
veneer implementation 

profile 

virtual device 
kernel module 

netli Iters 



Fig. 5. Interactions between Proxy let Components in LANode/Linux. 



— iptpcdev. jar — veneer implementation based on a loadable virtual char- 
acter device, 

— iptp-linux-i386 . jar — transparent-proxying module for a Linux kernel, 
compiled for 1386 processors. 

These will form the running proxylet, and interact as in Fig. O Control infor- 
mation passes from the CORBA interface, through the the veneer, the profile and 
the virtual device, and alters the behaviour the module and the traffic passing 
through it. 

The CCML document would include the fragment in Fig. El The first three 
components are generic components, so they appear unconditionally in the 
CCML as <L0AD> elements. 
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<! — generic components — > <L0AD URI="iptpproxylet . jar" /> 
<L0AD URI="iptpidl. jar" /> <L0AD URI="iptpv. jar" /> 

<DECIDE> 

<BRANCH> <! — only for character-device data planes — > 

<REL PRDP="profile" 

ARG="UK . ac . lanes .nativecode . CharDeviceLoader" /> 
<SET NAME="veneerName" VALUE= 

"UK . ac . lanes . iptp . chardev.CharDeviceIPv4TransProxyFactory" /> 
<L0AD URI="iptpcdev. jar" /> 

<DECIDE> 

<BRANCH> <! — only for 1386 Linux data planes — > 

<REL PR0P="arch" ARG="i386" /> 

<REL PR0P="os" ARG="linux" /> 

<L0AD URI="iptp-linux-i386. jar" /> 

<SET NAME="UK. ac . lanes . iptp . IPv4DeviceImpl" 
VALUE="iptp.o" /> 

</BRANCH> 

< ! — further alternatives . . . — > 

</DECIDE> 

</BRANCH> 

</DECIDE> 



Fig. 6. Example CCML. 



iptpcdev . j ar adapts the CharDeviceLoader profile to the veneer interface - 
in practise, this means translating calls to the interface methods into character 
streams to be transmitted through the virtual device. It must only be used if that 
profile is available, i.e. character devices can be loaded into the data plane, so its 
entry in CCML is preceded by a <REL> condition expressing this requirement, 
and placed in a <BRANCH> element. (If there were other alternatives (as in a 
richer scenario), they would appear as other branches in a <DECIDE> element, 
which is also shown in the example.) The component is also accompanied by an 
installation parameter (a <SET> element) to tell the generic parts of the proxylet 
which class from iptpcdev. jar implements the veneer interface. 

Having selected iptpcdev . j ar, it must have a component to load into the 
kernel. Only one is provided in this limited example, and this is the file iptp . o 
in iptp-linux-i386 .jar. This forms part of the proxylet only if the data plane 
is a Linux kernel on an 1386 processor — these conditions are placed within 
a BRANCH element, and an installation parameter is included to tell the veneer 
implementation where to find the module. The branch is placed within a DECIDE 
element alongside the character-device veneer implementation. If other processor 
types are supported by the proxylet, they will have similar branches inside the 
DECIDE. 

LANode with the Linux stub has the characteristics in Fig.Q Applying these 
to the CCML document produces a list of all the components listed above, plus 
the installation parameters of Fig.|Sl 



Component Selection for Heterogeneous Active Networking 



97 



prof ile=UK. ac . lanes .nativecode .LibraryLoader , 

UK . ac . lanes .nativeeode . ProgramLoader , 

UK . ae . lanes .nativeeode .ModuleLoader , 

UK . ae . lanes .nativeeode . CharDevieeLoader , 
org. ieee .pl520 . ipswg. routing. RoutingTable , 
UK . ae . lanes . registry . Registry 

os=linux 

areh=i686 , i586 , i486 , i386 



Fig. 7. Example LANode Selection Properties. 



veneerName= 

UK . ae . lanes . iptp . ehardev . CharDevieeIPv4TransProxyFaetory 
UK . ae . lanes . iptp . IPv4DevieeImpl=iptp . o 



Fig. 8. Example Installation Parameters. 



On a LANode with a different configuration, the document will not resolve, 
and an error will be reported. As compatible components for alternative plat- 
forms are written, they can be added to the document, making the proxy let as 
a whole compatible. 

4.5 CCML Processing 

We have constructed a Java library to perform evaluation of a CCML document 
against supplied properties, producing a list of tagged URIs and some configu- 
ration properties. 

This library forms the ‘CCML processor’ component of Fig. El In the code 
fragment of Fig. El documentLocation is a reference to the CCML document, 
and selectionProperties corresponds to the ‘target characteristics’. After pro- 
cessing, conf holds the ‘installation properties’, while transfers lists the com- 
ponent references and their tags. 

5 Future Work 

5.1 LANode Developments 

We intend to develop and settle the extended EEP API so that all future devel- 
opment is reduced to the definition of profiles. One such profile could present a 
generic router abstraction consisting of various components to classify, modify or 
forward packets through the data plane, and it should be possible to add further 
components dynamically. This might be based on the Click modular router HS|. 

We are investigating ways to allow efficient traversal of these components 
without the overheads of matching packet fields to configured values, e.g. by 
hashing on several fields (as used in 0) and determining if a given classifier is 
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import java. util.*; 
import j ava . net . * 
import UK . ac . lanes . ccml .* ; 

// inputs 

URL documentLocation = . . . ; 

Properties selectionProperties = . . . ; 

/ / output s 
Properties conf ; 

ComponentTr ansf er [] transfers ; 

ComponentSelector selector = 

new PropertiesComponentSelector (selectionProperties) ; 
ComponentURLResolver resolver = 

new ComponentURLResolver (selector) ; 

conf = new PropertiesO ; 

transfers = resolver .resolveToTransfers(documentLocation, conf); 



Fig. 9. Java Code for CCML Processing. 



based solely on those fields, or by employing a more generic, efficient, multi-field 
method of classification, e.g. m- 



5.2 CCML Developments 

CCML should have other applications, and in some cases, the language may have 
to be extended for a particular task. It already permits selection of components 
used to compose a proxy let, and should also be applicable to: 

— general software installation, whereby versions of software packages and plu- 
gins appropriate to a system can be installed there using a single reference 
to the software, 

— layered multimedia delivery, in which, say, an image is hierarchically decom- 
posed according to various dimensions (colour, spatial resolution, cropping 
position), and presented through HTTP as a CCML document that allows 
the components to be minimally selected according to the characteristics of 
the rendition, 

— stylesheet selection, where the dependence of the suitability of a stylesheet 
for a given medium can be more richly expressed. 

We will also investigate the use of CCML as a response to a URN resolu- 
tion request. This would reduce a name registration to the storage of a static 
document which is then resolved by the client (in terms of its own properties) 
after receiving the response. In cases where the document is very large, the client 
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may submit some of its properties with the request so that a partially resolved 
CCML document is returned. 

It may be necessary to extend CCML to cope with property types more 
closely associated with name resolution, for example, global position. These 
types could be added statically to the language, or we may devise a method 
of describing types through other documents to be processed dynamically. 

There may be problems if CCML is applied to situations where several inde- 
pendent dimensions of compatibility exist, resulting in m x n alternative com- 
ponents that must somehow be expressed as CCML elements. 

Future CCML developments will appear at our website [B|. 

6 Conclusions 

We have presented LANode, an active node that employs the discrete approach 
to active-code deployment. It recognises the distinction between control-plane 
and data-plane behaviour, and allows active code (proxylets) to span both planes. 

LANode has a small core ‘APT, but may have various extensions (profiles) 
per node, depending on the existence of programmable entities in the underlying 
operating system and hardware. This allows it to be deployed across networks 
of heterogeneous platforms, and to develop as new platforms appear and others 
fall into disuse. 

LANode’s proxylets may be built from several components to be composed 
when they are deployed. Some components will be independent of a node’s avail- 
able profiles, and so will be used in all deployments of a proxylet; others will be 
profile-dependent, conditionally linked with the proxylet to allow it to adapt to 
the available profiles with each deployment. These components may be re-used 
in other proxylets that need to perform the same kind of adaptation. 

We have presented CCML, a document format which expresses how a prox- 
ylet is to be formed from potential components depending on a node’s profiles 
and other properties; wholely incompatible proxylets can be rejected before any 
code is loaded. As new platforms appear, and profiles for them are defined, an 
existing proxylet can be extended to make use of them (and therefore adapt 
to the new platforms) by writing a new component and altering the proxylet ’s 
CCML document. CCML may have other uses in content negotiation. 
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Abstract. The main focus of active networking research so far has been 
at the infrastructure level, facing the challenges of designing suitable 
node operating system structures and the study of different programming 
models. This has left exploration of the actual utility of active networks 
to rather simple applications that have yet to exploit the full potential 
of the programmable network. 

In this paper we present an application-driven study of active networks, 
identifying unique and practical applications that make full use of the 
active infrastructure. We explore a class of applications in network moni- 
toring that indicate a clear need for programmability as offered by active 
networking technology. 

We have built several monitoring applications on an active substrate that 
is synthesized from off-the-shelf components. We demonstrate the flexi- 
bility provided while showing that for certain application workloads such 
a system can efficiently operate at modern backbone network speeds. Our 
performance study also leads to design considerations for scaling up the 
infrastructure to future network speeds. 



1 Introduction 

While there is generally no doubt about the increased potential and flexibility 
of active networks compared to the state-of-the-art adoption of this new 

technology is, to date, still limited. Performance and security considerations may 
have been important reasons for this, however, the lack of unique and appealing 
applications is equally or possibly more important. Motived by the so far rather 
dry focus on infrastructure issues, we argue for an application-driven approach. 
We identify and experiment with a class of applications in the held of network 
monitoring that we believe have the key properties for making active networking 
arguments. 

Network monitoring is an increasingly important, yet difflcult and demanding 
task on modern network infrastructures. As argued in 0, the scalability of the 
stateless IP networks has been bought at the expense of observability, which has 
in turn led to the recent study of several ways of monitoring networks in order 
to support control and management functions fTTilT^ . Most routers offer built- 
in monitoring functionality, accessible using mechanisms such as SNMP m or 
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NetFlow (21 . However, the domain of network monitoring has exposed fundamen- 
tal weaknesses of these static-functionality, protocol-based, parametric designs. 
User needs vary and are uncertain at design time, causing the available manage- 
ment interfaces to fall short of user expectations. Additionally, tasks attractive 
only to minorities of users are usually not cost effective to be integrated in 
routers. Finally, in cases such as detection and prevention of “denial-of-service” 
attacks, the need for timely deployment cannot be met at the current pace of 
standardization. 

Operational experience indicates a clear demand for a dynamically extensible 
system to support network monitoring and measurement-based applications. Our 
hypothesis is that allowing these applications to share a common, open and 
programmable monitoring platform can be more cost-effective and efficient than 
employing closed purpose-specific monitoring tools. The main idea is to allow 
soft real-time processing of traffic measurement data by application modules, 
as close to the information source as possible. This is in line with management 
by delegation (MbD) models with the key difference lying in the level of 
abstraction: we argue for a traffic measurement approach, rather than using 
the already diluted SNMP-based information. With respect to deployment, it is 
important that for a significant number of applications the proposed system can 
be introduced as a local enhancement. Of course, more wide-spread deployment 
is desirable to support distributed applications such as the traffic regulation 
mechanism described in this paper. 

Some unique characteristics of the application domain and the system we 
propose make this study both interesting and challenging. While formally be- 
longing to ’’management plane” functionality, our system needs to operate at 
the tempo of packet forwarding. This means that on one hand, it can be devel- 
oped as a separate system, without requiring modification of existing routers. 
On the other, it carries some of the performance concerns associated with the 
forwarding function. However, unlike packet forwarding, the need for real-time 
processing is less strict. For example, extensive buffering can be used to address 
transient peak demands. The task is also a good target for parallelization, which, 
while being interesting as a feasibility argument, is not the focus of this paper. 

In this paper we describe a proof-of-concept implementation of simple mon- 
itoring applications on a programmable infrastructure. These applications are: 
traffic analysis, usage accounting and traffic regulation for distributed ’’denial- 
of-service” attack detection and prevention. We also present a lightweight active 
substrate designed to support experimentation with these applications, that is 
based on off-the-shelf components. Our initial experiments demonstrate that 
such an approach to active networking is indeed feasible. The flexibility pro- 
vided by the system compared to its simplicity is admirable. We believe that the 
general methodology as well as the specific application domain is a promising 
avenue for further research. 

The remainder of this paper is organized as follows. In Section Owe present 
the applications and in Section 0 the active substrate used for building these 
applications. In Section Owe discuss experiments demonstrating the feasibility 
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and merits of our approach and studying performance issues. In Section 0 we 
present related work and in Section]^ we conclude and discuss further work. 

2 Applications 

2.1 Traffic Regulation 

The current Internet architecture offers very few protection mechanisms against 
ill-behaved traffic. Especially in recent years there has been an increase in Dis- 
tributed Denial of Service (DDoS) attacks pnEn) as well as flash crowd effects. 
A denial of service attack occurs when a single or multiple hosts transmit large 
amounts of traffic targeting a specific network node. Flash crowds occur when a 
large number of users access the same server simultaneousljQ, overwhelming the 
available resources. 

The two main problems are to detect the attack source(s), despite the fact 
that IP source addresses are spoofed by the attacker, and to respond, by confining 
or blocking traffic from the attacking sites. Both problems are a natural fit for 
an active networking approach. In fact, recent studies essentially assume 

some router programmability for implementing these schemes. Network nodes 
need to obtain information from the network to reconstruct the path towards 
the attacking sites. This can be done by monitoring traffic and recording the 
upstream router for each packet. Sampling techniques can be used to reduce the 
cost of monitoring. We show that this can be accomplished without modifying 
the router functionality. In response to an attack and once the attacker has been 
traced, router control commands can be initiated from our active substrate to 
block or otherwise confine the attacking traffic. 

An important benefit of using active networks in implementing IP packet 
traceback and traffic rate limiting is the ability to adapt depending on the type 
of ill behaved traffic. For example, active networks give users the flexibility of 
dynamically deploying the appropriate protocol to counter a possible attack. As 
new attacks are being invented it is easy to develop and deploy mechanisms to 
counter them. 



2.2 Traffic Analysis 

While primarily a tool for Internet research, traffic analysis is also becoming im- 
portant for service providers to support functions such as traffic engineering m- 
SNMP |23 and NetFlow jj] impose an a-priori specification of the granularity of 
the traffic data in the form of standardized objects. This was also the main trig- 
ger for the development of passive measurement systems such as OC3MON jS|. 
These systems passively “listen” on a network link and dump packet headers to 
disk. Trace files can then be downloaded for off-line analysis. The flexibility lies 
in that these systems capture the full stream of data for later processing. There 
are, however, certain constraints with this approach: 



1 



This is also known as the Slashdot effect. 



104 Kostas G. Anagnostakis et al. 



— Analysis of the data has to be performed post-mortem. There is no way 
to perform the measurements in real-time, unless there is “login” access to 
the measurement system. This restricts the use of the system to the actual 
infrastructure owner and trusted parties if real-time functionality has to be 
built into the system. 

— As the mismatch in growth trends between bandwidth, processing, disk speed 
and disk capacity increases, it is going to be increasingly impractical to 
maintain huge data-sets for post-processing. On-the-fly reduction of data by 
processing may be more efficient, especially if the processing task is limited, 
as is the case in the example module presented below. 

The contribution of an active networking approach is that users, depending 
on established trust relationships, can install modules on the monitoring system 
for efficiently performing real-time traffic analysis. 

We have built a simple analysis module to demonstrate the power of an ac- 
tive networking solution. The goal of our module is to observe the existence of 
packet trains. A packet train can be loosely defined as a set of consecutive pack- 
ets belonging to the same flow as observed on a network link. This is typical, 
for instance, for TCP traffic, which in its slow-start phase injects two pack- 
ets for each acknowledgment received. This kind of information is not available 
through the standard network management mechanisms. Also, a typical passive 
measurement system requires communication of all traffic to the analysis host. 
Our implementation is clearly more efficient, easy to implement, and adds min- 
imal overhead to the measurement system. The module consists of about 40 
lines of code (with an overhead of about 101 cycles per packet) and results in 
less than 400 bytes of measurement data. Other functions, such as extracting 
statistics on TCP window sizes or analyzing traffic burstiness, are easy to build. 
Similar functionality is not present in existing systems. 

2.3 Accounting 

The task of obtaining network usage statistics per accountable entity is impor- 
tant for managing an operational network. Often, this forms the basis for billing 
users for network usage. Volume-based billing schemes in use today rely on Net- 
Flow or SNMP-type accounting for calculating the billing scheme parameters, 
usually in terms of volume per network prefix. Recently, more dynamic pricing 
algorithms have been studied, especially with regard to providing differentiated 
services or controlling congestion. A representative example in this category is 
based on the ECN |22j mechanism as described in PE)- When the network 
becomes congested, routers mark packets probabilistically, with the probability 
depending on the level of congestion. Marked packets are charged a fixed amount 
of currency. Users who consume more resources during times of congestion are 
charged more than others. This scheme provides incentive for users to make 
reasonable use of network resources during congestion. 

To support ECN-based charging schemes, one must collect accounting infor- 
mation per network prefix as well as marked vs. non-mar ked packets. This is not 
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modules 




Fig. 1. Structure of the Active Substrate. 



supported by any of the existing router accounting mechanisms: NetFlow allows 
either per-ToS or per- AS or per-network accounting tables. We will demonstrate 
that the implementation of this scheme based on our lightweight active substrate 
is fairly simple. 

3 The Lightweight Active Management Environment 

The active substrate for the experiments in this paper is built entirely from 
off-the-shelf components. Two reasons influenced this choice: firstly, the goal of 
the paper is not to build a new active networking infrastructure and secondly, 
our claim is that an active network can and should be built in an incremental 
fashion. The overall architecture of our system is shown in Figure H] and is 
roughly equivalent to the structure of Switchware Jl] and other active networking 
prototypes. The system is built around the OpenBSD 2.8 [l] operating system. 
OpenBSD provides an attractive platform for developing secure applications 
because of the well-integrated security features and libraries (e.g. IPsec stack, 
SSL, KeyNote, etc.). Similar implementations, however, are possible with other 
operating systems or active network platforms. In the following paragraphs we 
will describe the various components of our system. 



Loader. The module loader is implemented as a system daemon which accepts 
TCP connections for loading and controlling modules. Users need to authenticate 
themselves to the active node by establishing a secure IPsec tunnel. The modules 
are implemented as native code shared objects. Upon receipt of a module, the 
loader can either execute it in its own virtual address space or load it inside the 
operating system kernel. The decision depends on the kind of credentials the 
objects use to authenticate to the loader. In our current prototype we assume 
that code executing in the operating system kernel will not accidentally harm the 
system. There are however a number of techniques to shield against this type 
of errors, like software-based fault isolation ||2S] and the use of large address 
spaces m, which can be easily adopted in our active substrate. 
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KeyNote-Version: 2 
Authorizer: NET_MANAGER 
Licensees: Traf f icAnalysis 

Conditions: (an_domain == "an_exec" && module == "capture" && 
(srcip == 158.130.6.0/24 I I 

dstip == 158.130.6.0/24) /* own network */ 

&& snaplen == 40) /* headers only */ 

-> "ANONYMIZE"; 

Signature : "rsa-md5-hex: f 0015673" 



Fig. 2. Example credential that grants ” Traffic Analysis” capture access to the network 
interface for traffic to/from 158.130.6.0, packet headers only. The packet source and 
destination addresses must be anonymized. 



Trusted Core. The applications that use our system require access to the pack- 
ets going through the node. On UNIX this can be accomplished using the Packet 
Capture library (pcap(3)) which provides wrapper functions for the Berkeley 
Packet Filter (bpf(4)). We extended the bpf_tap function inside the bpf (4) 
device driver to process the packet and packet headers according to the privi- 
leges of the application. For example, an application might be permitted access 
to specific flows only, as shown in Figure or be allowed to access packets head- 
ers (instead of full packets) and only after the IP addresses are anonymized for 
privacy (e.g. Figure^)- Traffic information can subsequently be passed on to the 
applications. 



Security Policy. We take a Trust Management approach to mobile code 
security. Trust Management is a novel approach to solving the generalized au- 
thorization and security policy problem. Entities in a trust-management system 
(called “principals”) are identified by public keys, and may generate signed pol- 
icy statements (which are similar in form to public-key certificates) that further 
delegate and refine the authorization they hold. This results in an inherently 
decentralized policy system: the system enforcing the policy need only consider 
the relevant policies and delegation credentials, which the user needs to provide. 

We have chosen to use KeyNote P| as our trust management system. KeyNote 
provides a simple notation for specifying both local policies and credentials. 
Applications communicate with a “KeyNote evaluator” that interprets KeyNote 
assertions and returns results to applications. A KeyNote evaluator accepts as 
input a set of local policy and credential assertions, and a set of attributes, called 
an “action environment,” that describes a proposed trusted action associated 
with a set of public keys (the requesting principals). The KeyNote evaluator 
determines whether proposed actions are consistent with local policy by applying 
the assertion predicates to the action environment. In our system, we use the 
action environment as the place-holder of component-specific information (such 
as language constructs, resource bounds, etc.) and environment variables such 
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Fig. 3. The Cycle of a Packet Being Processed by the System and User Code. 



as time of day and node name, that are important to the policy management 
function. 

We use KeyNote for performing policy compliance checks and settings when 
loading up the incoming object code. The KeyNote credential specifies what re- 
sources will be allocated to the newly created process. Object modules that exe- 
cute inside the kernel have no resource bounds. However user level processes are 
assigned specific resource permissions depending on the credentials they carry. 
We rely on the rlimit structure of the UNIX operating system, where limits 
for CPU time, memory size, number of allowed processes, etc., are specified. 
We also forbid user processes from modifying those values. We did not attempt 
to make the environment totally tamperproof as that would have been beyond 
the application-oriented scope of this work. However, there has been extensive 
research in this field which can be easily incorporated in our system mm- 

4 Experimental Study 

A number of experiments were performed with the implementation of our system 
on the test-bed shown in Figure 0 The experiments aim primarily to validate 
our design and study system performance. 

Our test-bed consists of 7 x86-based routers and an “edge” machine, all 
running OpenBSD 2.8. Five of these machines are IGHz Pentium HI with 256MB 
SDRAM, two are 400MHz Pentium HI with 256MB of SDRAM {Kerkyra and 
Ithaki) and one is a I66MHz Pentium with 256MB of SDRAM (Naxos). All 
links in this topology are point-to-point. The core of the network (resembling 
a network backbone) is comprised of 1 Gbit/s Ethernet links (3Com 3C985-SX 
33MHz 64 bit PCI) and the surrounding Ethernet links are 100 Mbit/s. The edge 
machine is connected with a 10 Mbit/s Ethernet to one of the backbone routers. 
We used the ALTQ |H| implementation for providing the router functionality 
assumed by the traffic regulation and ECN-based accounting applications. 

4.1 System Demonstration 

Traffic Regulation. To demonstrate the traffic regulation module we designed 
the following experiment. The attack progress and system response is illustrated 
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in Figure El Kythira (source) starts a TCP connection with our edge host, 
Naxos (sink). We limit the flow to 10 Mbit/s using ALTQ on the source. At 
times 5 sec, 15 sec and 25 sec, Kerkyra, Ithaki and Paxoi, start a UDP flood 
attack to Naxos. Since the IP source addresses are forged the host under attack 
{Naxos) must use a form of IP traceback to single out the offending hosts. Naxos 
registers abnormal link utilization by employing a simple monitoring module 
that measures overall load on the 10 Mbit/s link. Upstream router Cephalonia is 
requested to start sampling packets. A second monitoring module on Cephalonia 
samples the packets destined to Naxos and determines that offending packets 
arrive through Zakynthos and Lefkada. Cephalonia subsequently sends the result 
to Naxos. Naxos requests Zakynthos and Lefkada to start sampling and they 
return Kerkyra, Ithaki and Paxoi as sources of the offending packets. Once the 
offending sources are identified, Naxos asks Zakynthos and Lefkada to block 
traffic from Kerkyra, Ithaki and Paxoi. As the offending traffic is blocked, TCP 
traffic slowly recovers to its pre-attack levels. 

Traffic Analysis. A simple traffic analysis experiment was performed with the 
packet train analysis module (code is presented in Appendix A) . As our testbed 
does not carry real network traffic we generated traffic from host Lefkada to host 
Cephalonia^ using a 3 hour long IP header trace captured at the University of 
Auckland p. The module code, running on Cephalonia is shown in Appendix A 
and the results of the measurements in FigureEl Note that even a non-privileged 

^ Traces and tools are available from 
http : / /moat .nlanr.net/Traces/Kiwitraces/auck2.html 
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Fig. 5. Traffic Flood Escalation, Detection, and Filtering Events for Traffic Regulation 
Experiment. 



user could be allowed to install and run such a measurement module: no access to 
the packet payload is needed, and the algorithm works with anonymized packets 
as well. This demonstrates how such an approach can provide the technical basis 
for allowing researchers to use similar modules on real network links. 



Accounting. The implementation of the ECN accounting module is remarkably 
simple (the module code is included in Appendix A). For each packet, the module 
checks whether the CE (Congestion Experienced) bit is set in the IP header, and 
calls the accounting function add_to with the appropriate table of marked or 
unmarked traffic as an argument. To test this module in our experimental set-up, 
we enabled ALTQ’s ECN on the 100 Mbit/s link between Ithaki and Zakynthos 
(see Figure EJ. We generate traffic from the hosts behind Ithaki in the following 
way: 

— A simulated Network A, where a mixture of TCP connections is generated 
based on ttcp, with a Poisson distribution for connection arrivals and an 
exponential distribution for the connection duration (in bytes). 

— A simulated Network B, with a mixture of TCP connections with short- 
lived bandwidth-savvy UDP streams , all using ttcp. UDP traffic is non- 
congestion controlled, and this is clearly reflected in the resulting charges in 
Table □ 



110 Kostas G. Anagnostakis et al. 




Fig. 6. Graphical Representation of Packet Train Distribution. 
Table 1. Charges Matrix for ECN Experiment. 



user 


marked 

(charged) 


unmarked 

(free) 


total 

traffic 


charge 


Network A 
Network B 


104.29 MB 
283.84 MB 


2.88 GB 
1.39 GB 


2.99 GB 
1.67 GB 


USD 52.14 
USD 141.92 



4.2 Performance Study 

The questions that we attempt to answer with respect to performance is what 
cost the applications have in terms of processing and how much and what kind 
of overheads the system structure imposes. The results are summarized in Table 
121 We send traffic from Zakynthos to Cephalonia and analyze performance at the 
receiving monitoring system. 

To measure the system overhead, we have implemented a simple module 
that consumes a certain number of cycles for every packet received. We vary this 
number of cycles burned and measure how many packets are dropped by bpf. 
We calculate the maximum number of cycles available per packet on our system 
as the maximum number of cycles before the drop rate starts to grow above 4 
% .Figure 0 shows an experiment using traffic with a packet size of 1000 bytes 
and a rate of 90 Mbit/s. In this experiment, performance starts to deteriorate 
at around 55000 cycles. 

Using 473-byte packets at 155 Mbit/s rate (about 40000 pkts/sec) the number 
of cycles available is about 15000. Since Cephalonia is a 1.3 GHz P4 processor 
system, the theoretical maximum number of cycles available per packet is around 
30000. The large overhead is due to context switches, kernel-user space crossings. 
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Table 2. Cost Breakdown. Numbers in rows 1 and 2 refer to a 40 kpps traffic stream, 
equivalent to a 155 Mbit/s link fully utilized by 473-byte packets. 



Task 


Cycles / packet 




max. no. cycles (on 1.3Ghz P4) 


31737 




max. no. cycles (after system overheads) 


15277 


± 1052 


null module 


1831 


± 89 


anonymization 


131 


± 8 


dump to disk 


3293 


± 416 


ecn accounting 


850 


± 43 


pkttrain 


101 


± 0 


ddos detection 


390 


± 1303 



memory copies and bpf-internal overhead that is unavoidable as we rely on a 
composition of off-the-shelf components. The results indicate that a specialized 
system structure could improve system performance up to 100 % , but also shows 
that a single, purely software-based system may not be sufficient to support 
heavy workloads at higher network speeds. However, considering the application 
workload (shown in Table |21) the system is still sufficiently fast to support a 
fairly extensive application workload at 155 Mbit/s line rates. 



BPF performance - incoming traffic rate 90 Mbit/s 




processing cycies per packet 



Fig. 7. BPF Drop Rate vs Number of Cycles Consumed by Each Packet for 1000-Byte 
Packets Arriving at a Rate of 90Mbit/s. 
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5 Related Work 

Numerous system tools have been developed for network monitoring and per- 
forming management functions. The first generation of a measurement tools 
such as tcpdump(8) were based on the Berkeley Packet Filter m- bpf (4) pro- 
vides operating system functionality for traffic monitoring. Users employ filters 
specified in a Filter Language that execute on a filter machine, inside the ker- 
nel. OCSmon is a dedicated host-based monitor for 155Mbit/s OC3 ATM 
links. The host records the captured headers into files for post-processing. No 
real-time functionality for packet processing is considered in the original design. 
ntop(8) |BI is an end-system network monitoring tool that shares a subset of 
the motives of our approach to extensibility. ntop(8) allows plugins, in forms 
of shared libraries, to be developed independently from the system core. The 
difference lies in that full trust in plugins is assumed and no further security- 
enhancing function is build into the system. Also, there is no notion of using 
the measurement data for router control and no provisions for distributed ap- 
plications across the node boundaries. Windmill HH is an extensible network 
probe environment which allows loading of “experiments” on the probe, with 
the purpose of analyzing protocol performance. The NIMI project j‘2 llj provides 
a platform for large-scale Internet measurement, with Java and Python based 
modules. NIMI only allows active measurements (i.e. handling of probe pack- 
ets, in contrast to getting real traffic from the wire). The security policy and 
resource control issues are addressed using standard ACL-like schemes. In the 
active networking arena, ABLE m provides an environment for network man- 
agement applications. It allows applications to monitor the network using SNMP. 
Executing the SNMP applications closer to the managed elements reduces the 
communication burden. However, ABLE is limited to what is exposed through 
the SNMP MIBs. 

6 Conclusions and Future Work 

We have shown that network monitoring offers excellent soil for active networking 
solutions. This is mostly due to the continuously evolving and highly diverse 
nature of the monitoring applications. These characteristics impede designers 
and users from predicting and agreeing on a set of functions that can satisfy the 
needs of users in the long-term. If the quality of a design is judged by its ability to 
absorb the next wave of demand from its users, an active networking approach, as 
demonstrated in this paper, is a promising alternative to the existing designs. We 
also believe that such an application-centric, incremental methodology, can lead 
to experimental deployment of active networking technology while also advancing 
our understanding of the most crucial issues in the field. 

To validate our claims, we have built an active substrate, based on off-the- 
shelf system components, and developed a set of practical applications. Our 
study establishes evidence on the feasibility of providing and the utility from 
exploiting a programmable monitoring infrastructure. We show that, from a 
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performance perspective, our implementation is able to support a reasonable 
application workload at speeds of modern networks. 

Developing practical network applications has given us insight on several 
aspects to explore further. For instance, our current implementation can control 
host-based OpenBSD routers for demonstration purposes. Extending our system 
to perform active router control functions involving routers is possible on the 
same architectural principles, although it may require different mechanisms for 
obtaining access to the actual network traffic. Furthermore, our performance 
study has indicated that a purely software-based approach can be much more 
efficient than the specific system structure used for our experiments. The design 
of a purpose-specific system can improve performance by 100 % and becomes an 
interesting topic for further work. 

Departing from a purely software approach, we also see tremendous potential 
in the use of network processors such as H3|. Our system can be improved by 
moving the filtering function as well as some of the processing of the modules 
to the network processor micro-engines. Adding or reserving processing capabil- 
ities on router interfaces for monitoring purposes and exposing them for pro- 
gramming can also be highly useful for certain limited workloads. A detailed 
study and characterization of potential application workloads may, in turn, re- 
veal diverging requirements on the design of network processors. We expect that 
monitoring workloads can be both processing and memory intensive which can 
be different from forwarding-type functions considered in the current network 
processor designs. 

Finally, except for network processors, we would like to study other dis- 
tributed designs. Due to the fact that the application workload will consist of 
several independent modules, it can naturally be distributed among different 
processors. The applicability of our approach can hereby be extended to higher 
network speeds and more intensive workloads. 
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Appendix A - Module Code 

Packet Train Module Code: 

void f oolet_pkt (u_char *myparajns , void *pcaphdr, u_char *pkt) 

{ 

struct ip *iphdr = (struct ip *) (pkt + 14) ; 
static struct in_addr tr_src, tr_dst; 

if ( (iphdr->ip_src . s_addr == tr_src . s_addr) && 
(iphdr->ip_dst . s_addr == tr_dst . s_addr) ) 
train++; 

else { 

pktstats [train] += train; train = 1; 
tr_src = iphdr->ip_src ; 
tr_dst = iphdr->ip_dst ; 

} 

} 

ECN Accounting Module Code: 

void f oolet_pkt (u_char *myparams, void *pcaphdr, u_char *pkt) 

{ 

struct ip *iphdr = (struct ip *) (pkt + 14) ; 

if (iphdr->ip_tos & IPT0S_CE) 

add_to(marked_tbl, iphdr) ; 

else 



} 



add_to (unmarked_tbl , iphdr); 
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Abstract. This paper describes an adaptive IP-nnicast-based mnlticast 
protocol that dynamically constructs a multicast tree based only on re- 
quest packets sent by clients. Since the protocol is simple but flexible and 
does not require any IP multicast addresses or special multicast mech- 
anisms, unlike the IP multicast protocol, it is scalable and suitable for 
personal stream broadcasting and related services. An application that 
dynamically attaches advertisements to multicasted streams is also pre- 
sented. The application attaches advertisements to the streams at active 
nodes instead of the server so it can deliver advertisements tailored to 
the individual recipient, according to his/her interests and/or location. 
An algorithm that minimizes the attachment cost over the corresponding 
multicast tree is developed. The multicast and advertisement attachment 
mechanisms are implemented using our own active network environment, 
and their validity is confirmed. 



1 Introduction 

The world wide web (WWW) lets users send contents to unspecified recipi- 
ents and streaming data such as MPEG2 video is now larger percentage of the 
contents being sent. However, IP-unicast consumes too much bandwidth when 
delivering streaming data. 

IP-multicast is the most popular mechanism for multicasting to recipients 
located over a wide area. Many multicast routing protocols 0 d EH for IP- 
multicast have been proposed and are being investigated for stardardization at 
IETF. However, in order to use IP-multicast, we need special IP addresses, called 
IP-multicast addresses, to specify the multicast groups. This makes it difficult 
for individuals to multicast streams since it is necessary to obtain a unique IP- 
multicast address whenever IP-multicasting is desired. Furthermore, all routers 
need to be able to route IP-multicast packets. 

In the source-specific multicast protocol d, each multicast group is repre- 
sented by its server (source) address and a source-specific multicast address, 
which has to be defined by the standard. Thus, the multicast group can be 
uniquely specified since the server address is unique. However, all routers still 
need to recognize IP-multicast addresses. 



I.W. Marshall, S. Nettles, and N. Wakamiya (Eds.): IWAN 2001, LNCS 2207, pp. 116-^^^ 2001. 
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Recently proposed IP-unicast-based multicast protocols use the active net- 
work technology, uniini. The uniqueness of each multicast group is naturally 
guaranteed since the group is specified by its server address. This solves the 
problem encountered when individuals start to deliver streaming data. While the 
branch points of multicast trees need to be active nodes, it is not necessary to 
modify legacy nodes, i.e,. add special mechanisms, when introducing active nodes 
to existing networks, since the protocols use only IP-unicast packets. This makes 
it easy to use the multicast mechanism in a network composed of both active and 
legacy nodes. The protocol in H2| is so simple that it seems to be scalable. How- 
ever, it cannot dynamically move branch points when IP-unicast routing paths 
are changed. The protocol in m constructs forward-path-based shortest path 
trees for high-quality delivery of streaming data. In addition, the protocol can 
dynamically change branch points by using “ephemeral state probes” . However, 
the server and clients need to exchange many packets to gather information on 
network topology and to negotiate which branch point should be moved. While 
some heuristics to reduce such packets were given, the scalability of the protocol 
would be limited to some extent. 

We introduce here a new IP-unicast-based multicast mechanism that dy- 
namically constructs a multicast tree by sharing common links among unicast 
(reverse) paths from clients to a server. The communication realized by the tree 
still appears to be unicast for each server-client pair. Thus servers and clients 
do not have to support a special mechanism for multicast. Our tree construc- 
tion is based only on a hierarchical keep-alive mechanism, i.e., request packets 
are periodically sent to the server. The mechanism dynamically reconstructs the 
tree so that it prevents particular active nodes from being overloaded and can 
optimize the tree against changes in IP-unicast routing paths. Furthermore, this 
dynamic reconstruction naturally supports the mobility of servers and clients. 
The keep-alive mechanism is so simple that it is very scalable. When adopt- 
ing the hierarchical keep-alive mechanism, one of the most important issues is 
how to determine time-out. We provide an algorithm that dynamically changes 
time-out depending on the tree topology in use. 

We introduce here an algorithm for attaching advertisements to multicas- 
ted streams at active nodes that minimizes the total attachment cost over the 
corresponding multicast tree. By using the algorithm, each recipient can receive 
streams with suitable personalized advertisements. From a commercial point of 
view, this algorithm makes our multicast mechanism more attractive. An anal- 
ogy is drawn to the advertisement banners on private WWW pages. A recent 
trend is to customize advertisements to suit each viewer since this amplifies their 
effectiveness. However, it is difficult to customize advertisements at the server 
(root) of a multicast tree, since the server has to manage the advertisements 
sent to all clients. Advertising normally uses a small data set repeatedly. Thus, 
a reasonable approach is to dynamically load the advertisement data into active 
nodes and attach the advertisements that meet the preference of each recipi- 
ent at active nodes. In fact, active networks have been shown to be a desirable 
approach to adding extra functions to multicast mechanisms ^ EJP . 
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We implemented our multicast mechanism with the advertisement attach- 
ment function within our active network environment. 

Section 121 describes our multicast mechanism while Section |2| defines two cost 
models and gives an algorithm for each of them. Section 0 shows our implemen- 
tation; Section 0 provides a brief conclusion. 

2 Multicast Using IP-Unicast Addresses 

We present the protocol of our multicast mechanism and then describe its char- 
acteristics of load balancing and dynamic reconstruction. Next we consider how 
to implement the keep-alive mechanism used in the protocol. 

2.1 Protocol 

Clients who want to receive multicast data send a “join” packet containing the 
pair of the server address and port as its destination. The legacy nodes on the 
path between active nodes simply forward multicast packets according to their 
routing tables since the packets are just IP-unicast packets. When a join packet 
arrives at an active node on the path between the source and the destination 
(server), the source address written in the packet is registered with the multicast 
routing table at the node if the table exists; otherwise the table for the server 
is created before registration, and the node sends the server the join packet 
with its address as the source in order to join the multicast tree of the server. 
This joining operation propagates from the client through intermediate nodes 
until the join packet reaches the server or a node that already has the table. 
The tables determine to which client the multicast data from the server is to 
be delivered. From the above, it is clear that delivery follows the reverse of the 
unicast path from the client to the server. The reverse-paths between clients 
and the server are bundled as much as possible by the active nodes on the 
paths. Clearly, our multicast protocol can work even if the reverse-paths are 
different from the unicast paths from the server to clients, say, the forward- 
paths. While, in general, forward-paths give higher stream delivery quality than 
reverse-paths, forward-path-based tree construction often results in complicated 
or non-adaptable protocols. Our main goal is a scalable and adaptable protocol, 
which lets us adopt reverse-path based tree construction. 

Another simple but key feature of our multicast tree construction is its keep- 
alive mechanism. Each client continually sends join packets while it wants to 
receive multicast data. The pareniQ of the client sends multicast data if a join 
packet from the client arrives within some interval. In other words, a client that 
stops sending join packets expires and no multicast data is delivered to the 
client. The parent node also continues to send join packets to the server of the 

^ Following the usual terminology of a tree, for each node (including the server and 
clients) of a multicast tree, we refer to the server-side and client-side neighbor(s) 
as, respectively, the parent and the children of the node. Other related terms like 
ancestors or descendants will also be used without definitions. 
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Fig. 1. Client 7 joins when node 9 cannot accommodate any more children. 



tree, while it has at least one child that remains active. The same keep-alive 
mechanism works between the node and its parent. In this way, the keep-alive 
mechanism starts from clients and passes through at each parent and child pair. 

2.2 Load Balancing 

Consider the tree in Figure^ The topmost node is a server and the tree with the 
server as its root has 3 active nodes (numbered 8 to 10) and 6 clients (numbered 
1 to 6). Clients 1 to 6 are already members of the tree. Client 7 now sends a join 
packet to the server. Here we assume that nodes 9 and 10 can accommodate at 
most three and four children, respectively. Node 9 has a multicast routing table 
since it is already a node of the tree. However, the join packet from client 7 is 
forwarded by node 9 toward the server only if the link to node 10 has sufficient 
bandwidth, since node 9 has already three children. The join packet arrives at 
node 10 and client 7 is registered in node lO’s table. The resulting tree is shown in 
Figure 13 This is a natural extension of the basic protocol and it accommodates 
as many clients as possible without over-burdening active nodes. 

2.3 Dynamic Reconstruction of Multicast Trees 

In Figure 13 assume that client 5 stops sending join packets. This causes client 
5 to expire at node 9. Node 9 can now accommodate one more child. Thus, 
when the first join packet from client 7 arrives at node 9 after the expiration of 
client 5, client 7 is registered at node 9’s table as shown in Figure El Note that 
client 7 continually sends join packets to stay alive. The join packet from client 
7 is no longer forwarded by node 9 toward the server. This leads to the expira- 
tion of client 7 at node 10. In this way, the keep-alive mechanism dynamically 
reconstructs the tree so as to minimize traffic. 
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Clients 



Fig. 2. Tree after client 7 joins. 




Clients 

Fig. 3. Tree after client 5 stops sending join packets and expires. 



This dynamic reconstruction is effective especially when clients are mobile 
hosts. In Figure 0, client 3 moves and its neighbor changes from node 10 to node 
8. Client 3 is then registered at node 8 and expires at node 10. 

Another reconstruction is triggered by the movement of the server as is pos- 
sible when the server is a mobile host such as a palm-top computer with a small 
video camera. If the network has a platform to support IP-unicast to mobile 
hosts, say, the mobile-IP framework 0, the keep-alive mechanism reconstructs 
the tree as follows. As in Figure El the server moves and its neighboring active 
node changes from node 10 to 11. The join packets from clients are routed for the 
new location of the server by the mobile-IP framework, since the join packet is 
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Fig. 4. Tree after client 3 moves. 



an IP-unicast packet with the server address as its destination. Here we assume 
that node 11 comes to lie on the path from nodes 3, 8 and 9 to the server, and 
that node 10 is not on the path. As a result, nodes 3, 8 and 9 are registered 
at node 11, and node 11 sends a join packet to the server. Node 10 may also 
be registered at node 11 since it may send a join packet to the server, when we 
assume that node 11 is on the path from node 10 to the server. At node 10, 
nodes 3, 8 and 9 expire after a moment since the join packets from nodes 3, 8 
and 9 do not reach 10 any more. Node 10 then stops sending join packets to the 
server and expires at node 11. 

During tree reconstruction, multiple packets carrying the same data may 
arrive until the old parent-child relationships disappear. In Figure 0 nodes 9 
and 10 may send redundant packets to client 6. In Figure 0 nodes 10 and 11 
may send the redundant packets to nodes 3, 8 and 9. However, these packets can 
be discarded by checking the time-stamp and sequence number assigned at the 
server following, say, RTP (Real-Time Transport Protocol) 



2.4 Expiration Time 

The keep-alive mechanism must not allow a parent to expire before all of its 
children. A simple way of accomplishing this is to use a timer at each node: 
each node periodically sends a join packet to the server if there is at least one 
child that is active. However, to invocate packet sending by using a timer may so 
badly load resources that it may delay other processes like data packet delivery. 

Another way of implementing the keep-alive mechanism is for each node to 
send a join packet (if needed)only when the node receives a join packet. In other 
words, the sending operation is performed as the result of evaluating an active 
code in a join packet. However, if each active node sends a join packet every 
time it receives a join packet, the server may receive more join packets than it 
can process. 
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Clients 

Fig. 5. Tree reconstruction when the server moves. 



We now consider when a node should send a join packet and how to set the 
expiration time interval. Assume that all clients send a join packet periodically 
at interval D. Each node has a variable T, which is the time when the last join 
packet was sent. When the node receives a join packet, it compares the current 
time Tnow and T. If {Tnow ~T)>D, it sends a join packet. 

Before analyzing the above algorithm, we define the height of a node as the 
number of links of the longest unicast paths from clients of its descendants to 
the node. Denote the maximum interval between join packets being sent and 
received at a node with height h by Ds{h) and Dj.{h), respectively. 

If we assume that there is no link and processing delay jitter, i.e., Dr(h) = 
Ds{h — 1 ), it can be proved by induction that 

Dr{h) < h X D. 

Assume that a node with height h — I receives a join packet from a child just 
before time D has passed since the last time a join packet was sent, and the node 
does not receive join packets until receiving the next join packet from the child. 
This case maximizes Ds{h— 1 ). Thus, we have Dr{h) = Ds{h— 1 ) < D+Dr{h— 1 ). 
Clearly, Dr{l) = D. 

If we cannot ignore jitter, we can evaluate Dr(h) by modifying the above 
slightly. Let ei and C2 be the maximum increase of the link delay and the pro- 
cessing delay, respectively. Dr{h) < Ds{h — 1) + ci and Ds{h) < D + Dr{h) + 62- 
Thus, we have 

Dr{h) < /l X (Z? -|- Cl -|- €2) 

since Dr{l) < D + ei + 62- 

In a reliable network in which packets are seldom lost, a straightforward 
way of setting the expiration time interval is to set Dr(h*) as the interval at all 
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nodes, where h* is the height of the server. However, Dr{h*) is too long for nodes 
whose height is less than h* . This may result in consuming bandwidth wastefully, 
since an active node does not stop sending the streaming data to its children 
until they expire. A solution to this problem is for each node to dynamically set 
Dr{h) as the interval for a child with height h — 1. The node knows the height 
of its children in a dynamic-programming way by using join packets as follows. 
Clients write height 0 in the join packets they send. Active nodes choose the 
maximum value of the heights written in join packets from their children and 
write the value plus 1 in the join packet they send. Thus, an active node can 
know the height of its children from the join packets they send. This dynamic 
way of setting the expiration time interval ensures that the keep-alive-based tree 
construction remains scalable. 

The above procedure can be applied to computing other information such as 
the number of recipients. The preference vector described in the next section is 
one such type of information. 



3 Attaching Advertisements Customized for Each Client 

This section describes an algorithm that attaches advertisements that suit the 
preference of each client. 

Each client (user) chooses its favorite categories from n categories of adver- 
tisements (i.e. sports, cooking, travel, etc.)0. Without loss of generality, we can 
assume that all intermediate nodes can be branch points, i.e., active nodes, since 
we can think of a unicast path between neighboring active nodes as being a link 
even if there are legacy nodes between the active nodes. 

Each active node on the path between the server and each client can attach 
an advertisement to the stream coming from the parent if no advertisement is 
attached to the stream; otherwise it can replace/delete the advertisement. This 
attachment/replacement/deletion operation can be done for each child indepen- 
dent of the other children. 

For a network composed of such active nodes, we propose algorithms that 
attach an advertisement consistent with the preference of each client and mini- 
mize the total attachment cost over the tree under the two cost models defined 
below. 



3.1 Binary Cost Model 

We describe an algorithm based on the following cost model. 

Definition 1 (Binary Cost Model). It costs an active node 0 to forward the 
stream earning from its parent to a child, and 1 to attach an advertisement to the 
stream or replace/ delete the advertisement attached by one of its ancestors (and 

^ Each user does not need to explicitly choose the categories. Such choice can be done 
automatically by software that extracts the preference of users from their access 
history. 
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then send the resulting stream to a child). The active node is charged the cost 
for each child. In other words, the cost of the active node is the sum of the at- 
tachment/replacement/deletion costs for all its children. The cost of a multicast 
tree is the sum of the costs of all active nodes in the tree. 

If an active node with m children forwards the stream to x of the children and at- 
taches/replaces/deletes/ advertisements for m — x remaining children, the cost of 
the node is m — x. Therefore, minimizing the cost of a multicast tree is equivalent 
to minimizing the number of advertisement attachment/replacement/deletion 
operations that occur in the multicast tree. 



Algorithm. Our algorithm has two phases; request and delivery. In the re- 
quest phase, information on the clients’ favorite advertisement categories (or ad- 
categories) is propagated to the active nodes on the path to the server. Based on 
the information, each active node decides which operations should be performed 
and which ad-category should be used. The above decision is done so that each 
client can receive streams having the advertisement of its favorite ad-category 
and the cost of the multicast tree is minimized. If two or more advertisements 
belong to the same ad-category, one of them is selected in an arbitrary way. In 
the following, we will use ‘advertisement’ and ‘ad-category’ interchangeably. The 
delivery phase performs the operation decided at each node. 

The details of the request phase are as follows. 

Each client sends information on its favorite ad-categories expressed by a 
preference vector (defined below) to its parent. Each active node computes its 
favorite ad-categories from the preference vectors sent by its children, and sends 
the preference vector expressing the ad-categories to its parent. 

Definition 2 (Preference Vector). For the number n of ad-categories, a 
preference vector is an n binary-valued element vector (/i, / 2 , . . . , /«) {fi G 
{0,1}, i = l,2,...n) such that fi = 1 if the ith ad-category is desired, fi = 0 
otherwise. 

If no advertisement is attached, we still regard it as one ad-category, say, the first 
ad-category0. Thus, there are two choices at a node, replacement and forwarding. 

Each active node computes a new preference vector that will be sent to its 
parent by (1) summing up the preference vectors of all children, (2) choosing one 
or more elements having maximum values, and (3) setting 1 on the summation’s 
maximal elements of the new vector and 0 on the other elements. Note that 
multiple elements may be 1 if there are two or more maximal elements. 

For example, consider a node with two children as shown in Figure0 The two 
preference vectors from the children are (0, 1, 0) and (0, 1, 1). Preference vector 
(0,1,0) is sent to the parent of the node, since the sum of the two vectors is 
(0,2,1). 

® Note that the first element is always 0 if clients are prohibited from requesting the 
first ad-category (this is a necessary condition from a commercial point of view). 
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Fig. 6. Preference Vectors into and out of an Active Node. 



The node also creates or updates a table used in the delivery phase from the 
received preference vectors. This table indicates which advertisements the node 
should replace. This table is called the delivery table. For each child sending a 
preference vector F, if the incoming stream has an advertisement for the ele- 
ment with value 1 in F, then the forward operation is performed; Otherwise the 
advertisement is replaced with one of the advertisements for the elements (and 
the resulting stream is sent to the child). 

It is convenient to express the table as an n x m matrix. Element (j, j) of the 
table expresses the advertisement with which the advertisement of the incoming 
streaming data is replaced in order to send the resulting stream to the jth child, 
when the advertisement of the incoming stream belongs to the fth advertisement. 
Hence, the table created at the node in Figure Elis 




where its children are numbered from left to righl0. For example, if the streaming 
data with the first advertisement enters the node in FigureEl the first row of the 
above table indicates that the advertisement should be replaced with the second 
one for sending the resulting data to the first child, and it should be replaced 
with the second or third one for the second child. 

We describe the behavior of the request phase on a multicast tree using 
Figured Assume that there are three ad-categories. Here, as mentioned before, 
if no advertisement is attached it is regarded as the first ad-category. In Figured 
clients 5 and 7 are interested in ad-category 2, and client 6 in ad-category 3. By 
the definition of preference vectors, a client who is interested in the second ad- 
category sends preference vector (0,1,0) to its parent, and a client who likes 
the third ad-category sends (0, 0, 1). Active node 3 sends (0, 1, 0) to active node 

In the rest of this paper, we adopt the same rule for numbering children. 
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Fig. 7. Request Phase of the Binary Cost Model. 



2, active node 4 sends (0, 1, 1) to active node 2, active node 2 sends (0, 1,0) to 
active node 1, and active node 1 sends (0,1,0) to the server. At this time, the 
tables created at active nodes 1, 2, 3 and 4 are, respectively. 






The delivery phase begins when when preference vectors arrive at the server. 

When the server does not have the advertisement function, it regards the 
preference vectors as just requests for the streaming data it has, and sends the 
streaming data without any advertisement to its children, i.e., the sources of the 
requests. When the server can attach advertisements, it sends to each child the 
streaming data with the advertisement for one of the preference vector elements 
with value 1. 

The streaming data is multicasted with advertisement replacement opera- 
tions being performed at each node according to the delivery table. Figure 0 
shows advertisement replacement. In Figure 0 the number written in each de- 
livery packet expresses the category of the advertisement carried by the packet; 
the delivery packets are those that carry streaming data with advertisements 
in the delivery phase. This example assumes that the server does not have the 
advertisement attachment function, i.e., the server always sends the streaming 
data with the first ad-category. 

The above discussion does not consider the difference in the arrival times 
of the preference vectors. However, it is unlikely that all preference vectors will 
arrive at the same time. Another problem is that clients may join or leave the 
multicast tree at any time. This may change the preference vector to be sent. Yet 
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delivery packet 

with the /th advertisement 



Fig. 8. Delivery Phase of the Binary Cost Model. 



another problem is that some clients in the tree may change their favorite ad- 
categories while receiving the stream. To solve these problems, each node records 
the preference vector of each child, computes a new preference vector from the 
recorded vectors, and sends the computed vector to its parent at the time some 
event occurs, i.e., when a new client joins the tree as a child of the node, when 
some child leaves the tree, or when the preference vector of a child changes. 
Note that clients need to send preference vectors when changing their favorite 
ad-categories as well as when joining the tree. It depends on implementation as to 
when to check whether the above events have occurred or not. A straightforward 
way is to check periodically by using timers built in the active nodes, as the case 
of the keep-alive mechanism in Section El However, using timers often delays 
other processes. In our implementation, checking is triggered by the arrival of 
preference vectors. 

In the transitional intervals during which delivery tables are being updated, 
the advertisement of the incoming streaming data may ‘contradict’ the preference 
vector that was sent last. However, clients can receive the streaming data with 
their favorite advertisement provided the table at their parents is updated. 

The following theorem guarantees the optimality of our algorithm over stable 
(i.e. non-transitional) intervals. The theorem can be proved by using inductively 
the following fact: each active node replaces advertisements in order to mini- 
mize the number of advertisement replacements and so minimize the cost of the 
subtree with the active node as its root. 

Theorem 1. For a given multicast tree, our algorithm minimizes the cost of the 
tree under the binary cost model over stable intervals. 
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3.2 General Cost Model 

In this section, we define the general cost model and generalize an algorithm for 
minimizing the advertisement attachment cost over a multicast tree under the 
cost model. 

Definition 3 (General Cost Model). For each active node k, a cost function 
gk{i,a\, ...,ara) for advertisement replacement exists where m is the number of 
children of k, i is the category of the advertisement in the incoming streaming 
data, and aj {j = 1, 2, . . . m) is the category of the advertisement with which the 
advertisement of the incoming data is replaced before being sent to the jth child. 
The cost of active node k is the value computed by gk{i,a\, and the cost 

of a multicast tree is the sum of all active nodes in the tree. 

In the above definition, we assume that advertisements of the same category can 
be regarded as equivalent to each other in terms of the cost function. If not, we 
divide a category into sub-categories so that we can continue to use the above 
assumption. We use advertisement and ad-category interchangeably as in the 
previous section. 



Algorithm. Our algorithm has the request and delivery phases. In the request 
phase, each node (and client) k sends to its parent P{k) a cost vector defined 
below and a delivery table is created at each node. The table has the same 
meaning as in the binary cost model. The delivery phase acts in the same way 
as in the binary cost model. In the following, we focus on the request phase. 

Definition 4 (Cost Vector). For the number n of ad- categories, a cost vector 
that node k sends to its parent is an n-dimensional vector = (ui,u|, ...,vff), 
where is the minimum cost of the subtree T{k) with root k that is attainable 
when k receives the streaming data with the ith ad-category. The cost vector 
sent by each client has value 0 at the elements for its favorite ad-categories and 
infinity, denoted by ‘inf’, at the other elements. 
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For example, a cost vector sent by a client who wants to receive the fourth 
ad-category is (inf, inf, inf, 0, inf, ..., inf). The first ad-category means that no ad- 
vertisement is attached as in the previous section. Hence, the vectors from clients 
have ‘inf’ at the first element. 

Each node k computes vf {i = 1, 2, . . . , n) from the cost vectors sent by its 
children and cost function oi, . . . , am) as follows. 

If c{k,j) is the jth child of k, the cost vector from the jth child of k 
is '^c{k,j) = • • • , Un Heucc, is the minimum cost of 

T{c{k,j)) when k sends c{k,j) the streaming data with the a^th ad-category. 
Therefore, the minimum cost of T{k) when k receives the streaming data with 
the zth ad-category and, for each j, sends the jth child the data with the ajth 
ad-category, is given by gk{i, oij • ■ • j dm) + -I- • • • -I- This leads to: 

m 

Vi = min {6ffe(i,ai,...,a„)-h 

By computing this for each i, we obtain 

Delivery tables are created/updated during the computation of such 
that the zth row of the table is (ai, 02 , . . . ,am) which gives uf for each i {i = 
1,2,... to). By referring to the table, the delivery phase replaces advertisements 
at each node k so that the cost of T(fc) is minimized. Such operation is per- 
formed recursively from the root ( server ) to leaves ( clients ). Thus we have 
the following theorem. 

Theorem 2. For a given multicast tree, our algorithm minimizes the cost of the 
tree under the general cost model over stable intervals. 

If, for an over-loaded node k, we define the cost function of k so that the function 
gives relatively large values for replacement operations, the forwarding opera- 
tion is likely at node k. In this way, the load-balancing of active nodes can be 
accomplished by dynamically updating the cost functions according to the load 
of node k. The algorithm can also control the traffic through each link. If the link 
from P{k) to node k is crowded, the algorithm can reduce the traffic of the link 
by deleting the advertisement at P{k). This is accomplished by updating the 
cost function of k so that it gives relatively large values when the advertisement 
in the incoming stream does NOT belong to the first ad-category. 



3.3 Applications 

It is simple to modify the above algorithms so that information from the server 
can be used to force the selection of one of multiple advertisements in the same 
ad-category. Examples of the information from the server include the number of 
recipients or the kind of movie. By inserting the information into each delivery 
packet at the server, active nodes can be directed to choose the most appropriate 
advertisement. 
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Table 1. An Example of a Vector Table. 



child address 


preference vector 


arrival time 


10.27.124.5 

10.27.124.7 

10.27.124.1 


01010010 

01001000 

00100000 


20:19:18 

20:19:15 

20:19:20 



Table 2. Functions and Source/Destination Addresses of Request/Deliver Pack- 
ets. 



Packet name 


Source addr. 


Destination addr. 


Function 


Request packet 


Client or 
Active node 


Server 


Join the multicast tree. 
Keep alive, and 
Carry a preference vector 


Delivery packet 


Server 


Client or 
Active node 


Carry the streaming data 



In addition, the proposed algorithms have other applications. Transcoding 
(transforming encoding formats) at active nodes is one such application. The 
preference vector can be utilized to carry the clients’ desired format back up the 
tree. The same mechanism used to minimize the cost of replacement operations 
will also minimize the cost of transcoding operations. 

4 Implementation 

To implement our algorithm, we used our active network environment which is 
based on a 100BASE-T Ethernet. The architecture of the environment is based 
on the “active packets approach” the active code contained in a active packet 
can read from and write to the memories of each node. Memory contents are 
kept after the evaluation of a packet is completed. 

In our commercial multicasting implementation, a server sends out an 
MPEG2 data stream and clients receive the stream with their favorite adver- 
tisements. We used characters as the advertisement data in the first trial, and 
the implemented advertisement attaching algorithm is based on the binary cost 
model. We used two kinds of active packets: request packets and delivery packets, 
as shown in Table O 

The active packet in the request phase includes a preference vector, and the 
delivery packet includes the MPEG2 streaming data and the advertisement data. 
We call the packets carrying a preference vector the request paekets. They also 
have the join-packet function for multicast-tree construction. In other words, 
request packets are sent at interval D by the clients and are used to dynami- 
cally construct multicast trees (i.e., create/update the multicast routing table) 
as shown in Section 0 as well as to create/update the delivery tables. Thus 
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DispCIient (10012) 



F for all sky goo 

J News F" Sports J Weather J Music 
-1 Travel J Gourmet J Other -I None 

Channel 1 -i | 

Viewer. Control Panel. 

Fig. 10. A Snapshot of the Client Interface. 




the source and destination of the request packet is the client and the server, 
respectively. 

When request packets arrive at an active node, the preference vectors con- 
tained in the packets and their arrival time are recorded for each client in a 
table, called a vector table. Tabled is an example of a vector table. The deliv- 
ery table then is created or updated. Next, a new preference vector is computed 
from the preference vectors recorded in the vector table according to the method 
in section [1. II Finally the new vector is sent to the server (by using a request 
packet) if it is different from the last vector sent or time D has passed since the 
last time a request packet was sent. At this time, the source and destination of 
the request packet is the active node and the server, respectively. These above 
operations are performed by evaluating the active code in each request packet. 

Delivery packets are initially sent by the server to its children (i.e. active 
nodes or clients). When a delivery packet that has the Ah ad-category arrives, 
the advertisement replacement operation is performed (if needed) for each live 
child j according to the (*, j) element of the delivery table, and the delivery 
packet is sent to the jth child, where the source and destination of the delivery 
packets is the server and the jth child, respectively. Whether a child is alive or 
not is determined by comparing the current time with the arrival time of the 
last request packet from the child. These actions are performed by evaluating 
the active code in the delivery packet. 

Note that the vector table and the delivery table as well as the multicast 
routing table are required for each multicast tree (i.e. each server). In our im- 
plementation, these tables are merged into one table for space efficiency. 

Figure E3 shows the user interface of the client program. The right panel 
(control panel) is for selecting favorite ad-categories and choosing channels. Fa- 
vorite ad-categories can be changed while receiving the streaming data. The 
advertisement is displayed on the upper part of the control panel. The left panel 
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clients 

Fig. 11. Experimental Environment. 



is a viewer that displays the streaming data (MPEG2 movies). Each channel 
corresponds to a multicast tree. In our experiment, we prepared several servers 
as shown in Figure El Any of the streaming data from these servers can be dis- 
played in the client program by selecting the corresponding channel. Feasibility 
testing confirmed that implementing the algorithms did not create a bottleneck 
in terms of CPU performance. 

5 Conclusion 

With IP-unicast-based multicasting, individuals can easily multicast streams, 
since the global uniqueness of each multicast group is naturally guaranteed by 
server address and port. Furthermore, it is not necessary to modify legacy nodes 
when introducing active nodes as branch points to existing networks in order to 
service IP-unicast-based multicasting. 

In this paper, we proposed a scalable and adaptable IP-unicast-based mul- 
ticast protocol. The basic idea of our protocol is to dynamically construct a 
multicast tree by sharing common links among unicast reverse-paths between 
a server and clients. The multicast communication realized by this mechanism 
still appears to be unicast for each server-client pair. This aspect releases servers 
and clients from needing any special multicast mechanism. The keep-alive tech- 
nique, a key feature of our protocol, enables the load-balancing of active nodes 
and dynamic tree reconstruction. It uses only keep-alive packets, i.e., request 
packets periodically sent to the server. This simple mechanism makes our proto- 
col scalable. The dynamic tree reconstruction so realized naturally supports the 
mobility of servers as well as clients. In order to reduce the load of the nodes, 
the keep-alive mechanism does not use timers, only keep-alive packets need be 
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sent to active nodes. This timerless approach is realized by an algorithm that 
dynamically sets the appropriate time-out at each active node. 

To make our multicast more attractive, we developed an algorithm that at- 
taches at active nodes advertisements that fit the preference of each recipient. 
The algorithm determines which active nodes should attach which advertise- 
ments in order to minimize the total attachment cost over each multicast tree. 
The advertisement attachment algorithm is so general and effective that it can be 
applied to other services such as transforming encoding formats (transcoding). 

We implemented our multicast mechanism with the advertisement attach- 
ment function within our active network environment. Our work so far has 
demonstrated the validity of the idea. In the future, we plan to quantitatively 
evaluate the mechanisms’ performance. 



References 

[1] H. Akamine, N. Wakamiya, M. Murata, and H. Miyahara. An approach for hetero- 
geneous video multicast using active networking. In Proceedings of IWAN 2000. 
IFIP, 2000. 

[2] B. Duysburgh, T. Lambrecht, B. Dhoedt, and P. Demeester. Date transcoding in 
multicast sessions in active networks. In Proceedings of IWAN 2000. IFIP, 2000. 

[3] H.W. Holbrook and D.R. Cheriton. IP multicast channels: EXPRESS support for 
large-scale single-source application. In Proeeedings of SIGCOMM, 1999. 

[4] L.H. Lehman, S.J. Garland, and D.L. Tennenhouse. Active reliable multicast. In 
Proceedings of INFOCOM’98. IEEE, 1998. 

[5] K. Psounis. Active networks: Applications, secuirity, safety, and architectures. 
IEEE Communicatons Surveys, pages pp. 2-16, 1999. 

[6] RFC1075. Distance Vector Multicast Routing Protocol. IETF Home Page: 
http : // WWW . ietf . org 

[7] RFC1584. Multicast Extensions of OSPF. IETF Home Page: 

http : // WWW . ietf . org 

[8] RFC1889. RTP: A Transport Protocol for Real-Time Applications. IETF Home 
Page: http://www.ietf.org 

[9] RFC2002. IP Mobility Support. IETF Home Page: http://www.ietf.org 

[10] RFC2117. Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol 
Specification. IETF Home Page: http://www.ietf.org 

[11] RFC2189. Core Based Trees (CBT version 2) Multicast Routing. IETF Home 
Page: http://www.ietf.org 

[12] I. Stoica, T.S. Eugene, and H. Zhang. Reunite: A recursive unicast approach to 
multicast. In Proeeedings of the INFOCOM 2000. IEEE, 2000. 

[13] S. Wen, J. Griflioen, and K.L. Calvert. Building multicast services from unicast 
forwarding and ephemeral state. In Proceedings of OPENARCH 2001. IEEE, 
2001 . 



Compiling PLAN to SNAP* 



Michael Hicks^, Jonathan T. Moore^, and Scott Nettles^ 

^ Computer Science Department, Cornell University 
mhicksScs . Cornell . edu 

^ Computer and Information Science Department 
University of Pennsylvania 
jonmOdsl . cis .upenn.edu 

® Electrical and Computer Engineering Department 
The University of Texas at Austin 
nettlesOece . utexas . edu 



Abstract. PLAN (Packet Language for Active Networks) 0 is a highly 
flexible and usable active packet language, whereas SNAP (Safe and 
Nimble Active Packets) j'l 2[ offers significant resource usage safety and 
achieves much higher performance compared to PLAN, but at the cost 
of flexibility and usability. Ideally, we would like to combine the good 
properties of PLAN with those of SNAP. We have achieved this end by 
developing a compiler that translates PLAN into SNAP. The compiler 
allows us to achieve the flexibility and usability of PLAN, but with the 
safety and efficiency of SNAP. In this paper, we describe both languages, 
highlighting the features that require special compilation techniques. We 
then present the details of our compiler and experimental results to eval- 
uate our compiler with respect to code size. 



1 Introduction 

One of the most aggressive approaches to active networking involves the use of 
active packets (or capsules j I Sj ) — packets where the traditional passive header is 
augmented or replaced with a program. This program is executed as it traverses 
the network, thus offering per-packet customizability for applications. Indeed, 
this flexibility has been used to implement a variety of applications, such as 
application-specific routing , transparent redirection of web requests to nearby 
caches |Zj, distributed on-line auctions reliable multicast HD. mobile code 
firewalls and reduced network management traffic PS!. 

Existing active packet systems differ in how they trade off four design cri- 
teria: flexibility, safety, usability, and performance. In this paper, we consider 
two active packet systems: PLAN (Packet Language for Active Networks) 0 
and SNAP (Safe and Nimble Active Packets) m- plan’s design places an em- 
phasis on flexibility and usability. In particular, PLAN is flexible enough to be 
centerpiece of an active internetwork, called PLANet (HI, in which every packet 

* This work was supported by the NSF under contracts ANI 00-82386 and ANI 98- 
13875. 
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contains a PLAN program. Furthermore, because PLAN Is a high-level language, 
it is fairly usable. Indeed, the PLANet distribution has been downloaded by over 
450 users and is actively being used for research {e.g. 0) ED - SNAP, on the other 
hand, is a bytecode-style language designed with an emphasis on safety and per- 
formance. In particular, SNAP sacrifices some flexibility (relative to PLAN) to 
provide resource bounding guarantees, and sacrifices some usability, since it is a 
low-level language, to aid performance. 

The goal of this work is to show that we can have the best of both worlds by 
compiling PLAN into SNAP. Because the two languages differ in their models 
of active packet execution and indeed in their basic programmatic expressibility, 
the compilation process is not straightforward. Nonetheless, our compiler allows 
us to achieve the flexibility and usability of PLAN while attaining the safety and 
performance of SNAP. 

We begin by introducing PLAN and SNAP in SectionsOandOlrespectively. In 
particular, we will highlight the differences in programming model and express- 
ibility that we will have to overcome during translation. In Section 0 we then 
present compilation techniques that allow us to overcome these problems, and 
experimentally evaluate our compiler in Section 0 Finally, we describe future 
and related work and conclude in Section 0 



2 PLAN 

PLAN 0 is a strongly-typed functional language with syntax similar to Stan- 
dard ML PLAN is part of a two-level active networking architecture 0; 
namely, active packets provide the control logic and “glue” for tying together 
and controlling node-resident services. In this sense, PLAN programs/packets 
are similar to Unix shell scripts that provide control over utility functions like 
sort and grep. 

PLAN supports standard programming features, such as functions and arith- 
metic, and features common to functional programming, like lists and the list 
iterator fold (intuitively, fold executes a given function / for each element of a 
given list, accumulating a result as it goes). A notable restriction is that functions 
may not be recursive and there is no unbounded looping; this helps guarantee 
that all PLAN programs terminate. 

To support packet transmission, PLAN includes primitives for remote eval- 
uation: a user can specify that a computation (function call) take place on a 
different node. The two main primitives are OnNeighbor and OnRemote. Both 
take arguments that describe which function to call, a set of actual arguments, 
and a node on which to evaluate the call (the evalDesf). For OnNeighbor, the 
evalDest must be one hop away from the current node, whereas OnRemote also 
accepts a routFun argument to specify a particular routing function, which de- 
termines the routing on nodes leading to the ultimate evalDest. Both primitives 
must also be supplied with a resource bound, which acts as a TTL or hop count. 
The resource bound is decremented on each hop and during execution a program 
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fun ping (payload: blob, port:int) = 

OnRemoteC Ireply I (payload, port) ,getSrc() ,getRB() ,def aultRoute) 

fun reply(payload:blob,port : int) = deliverUDP (payload, port) 



Fig. 1. PLAN Ping. 



must “donate” some of its resource bound to the OnRemote or OnNeighbor used 
to send a new packet. 

PLAN provides the ability to manipulate programs as data, via a construct 
known as a chunk (short for ‘code hunk’). Chunks provide the means for PLAN 
packets to be fragmented or encapsulated within one another. A chunk has three 
logical elements: some PLAN code, an entry function name, and actual argu- 
ments. Evaluating a chunk results in looking up the named entry function and 
calling it with the arguments. Note that OnNeighbor and OnRemote actually 
transmit chunks. uni considers programming with chunks in detail. 



2.1 Example: PLAN Ping 

Figure n shows how to program ping in PLAN. Initially we start with a packet 
whose evalDest is set to our ping target, an entry point function ping, and actual 
arguments payload and port, which are a payload and an application UDP port 
number, respectively. 

When the packet reaches the destination, we evaluate the call to ping, which 
in turn creates a chunk I reply I (payload, port) which will invoke reply with 
our payload and port number. We determine our original source via the getSrc 
service, and then use OnRemote to cause our new chunk to be evaluated on that 
node. This spawns a new packet that makes the return trip to our source using 
defaultRoute to determine routing at intermediate nodes. The call to getRB 
returns all of the current packet’s resource bound and donates it to the new 
packet. 

Finally, when the return packet arrives at the source, we evaluate its chunk 
and thereby call reply, which simply delivers our payload to the application 
waiting on the appropriate UDP port. 



2.2 Advantages and Disadvantages 

PLAN is flexible and easy to use. Despite the limits on its computational power, 
it is a high-level language both in its syntax and its features. In particular, 
the combination of chunks with the remote evaluation provide a powerful set of 
abstractions that are tailored for packet programming. These powerful features, 
along with support for general language constructs such as function definitions 
and calls, are a key part of its flexibility. 
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Unfortunately, PLAN has some important disadvantages. In particular, trans- 
mitting and receiving PLAN requires marshaling and unmarshalling some rep- 
resentation of PLAN programs 0 . This significantly increases the cost of PLAN 
packets compared to more conventional approaches. Secondly, although PLAN 
programs are guaranteed to terminate, this guarantee is rather weak. In par- 
ticular, it is possible for PLAN programs to execute in time and space that is 
exponential in their length (we will discuss this in more depth later in Section 2J). 
For large packets, this may be essentially no different than unbounded execution. 



3 SNAP 

SNAP [1 1 1 j is a second-generation active packet system that was designed, 

in part, to address other active packet systems’ (including PLAN) weaknesses 
in performance and resource safety. The main thrust of SNAP’s design was to 
restrict the flexibility of the packet programming language in return for improved 
safety and a more streamlined and efficient implementation. 

The key safety gain of SNAP over PLAN comes from its model of resource 
usage: SNAP programs use time, space, and bandwidth linearly in the length of 
the program. The SNAP implementation achieves this by requiring all branches 
to go forward, thus preventing looping and causing the number of instructions 
executed to be limited by the program’s length. This, when combined with the 
fact that SNAP instructions execute in constant time, means that all SNAP pro- 
grams run in time linear in their length. Similar restrictions on each instruction’s 
memory and bandwidth use achieve the other bounds. 

To enable efficient execution, SNAP was designed as a stack-based bytecode 
language, providing instructions for performing simple arithmetic, environment 
query, control flow, and packet sends. A full description of the language can be 
found in DI. but we will highlight here the features that impact our use of 
SNAP as a compilation target for PLAN. 

Being a stack-based language, all computation occurs by removing arguments 
from the stack, performing the computation, and then pushing the result back 
on the stack. SNAP includes three main stack manipulation primitives so that 
programs can properly order their arguments on the stack: push v pushes the 
value V on top of the stack, pop removes the top stack item, and pull n pushes 
a copy of the nth stack element onto the stack. 

Aside from stack storage, SNAP also includes the notion of a heap, in which 
we can store variable-sized data (such as binary payloads) or data not accessed 
in a stack-based discipline. The two main primitives for heap manipulation in 
SNAP are mktup n, which creates a new heap-allocated tuplqj out of the top n 
stack values, and nth n, which extracts the nth element of the tuple pointed to 
by the top stack value. 

There are also a variety of program control flow instructions, such as jmp 
(“jump”), bne (“branch if not equal to zero”), beq (“branch if equal to zero”), 

^ Tuples can be thought of as a small array of values, similar to a struct in C. 
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and paj (^op and jump” ) . Each of these instructions takes a constant immediate 
argumenljj that is a non-negative relative branch offset. In the case of paj, this 
offset is added to the top stack value (again, the resulting offset must be non- 
negative) . 

Finally, SNAP contains a variety of packet sending operations. In addition to 
a more general send instruction for doing arbitrary packet sends, SNAP includes 
a forw instruction which compares the current node to a packet’s destination 
address and, if the packet has not yet reached its destination, forwards the packet 
towards its ultimate destination. 



3.1 Example: Ping 



forw 

bne 5 

push 1 

getsrc 

forwto 

pop 

demux 



; move on if not at dest 
; jump 5 instrs if nonzero on top 
; 1 means “on return trip” 

; get source field 
; send return packet 
; pop the 1 for local ping 
; deliver payload 



Fig. 2. SNAP Code for Ping. 



Figure Elshows a ping program written in SNAP assembly language. Assume 
we send a packet containing this program and a stack of three stack values: 
[0 :: port :: payload]. The 0 will be used as a flag to indicate whether we are in 
the outgoing direction or on the return trip back to the source. 

The packet will first forward itself to the destination by executing the forw 
instruction on each intervening node. Upon reaching the destination, the forw 
instruction falls through to the bne (“branch if not equal to zero”) instruction. 
Since there is a 0 on top of the stack, the branch falls through as well. We then 
push a 1 onto the stack to indicate we are ready to return. 

Next, the program retrieves the packet’s source address via the getsrc in- 
struction and then sends itself back towards the source with the forwto instruc- 
tion, thus resetting the packet’s destination address to be the original source 
address. This packet will then forward itself all the way back to the source with 
forw. Upon reaching the source, the forw will fall through. Since there is a 1 
on top of the stack, the bne will branch forward 5 instructions to the demux 
instruction, which delivers our payload to the given porlH. 

^ Our SNAP assembly language allows constant expressions involving code label po- 
sitions or the location of the current instruction (pc, or program counter). 

® The pop instruction is included to allow a node to ping itself (no packet sends result 
in this case). 
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3.2 Advantages and Disadvantages 

Essentially by design, SNAP’s advantages and disadvantages are the converse of 
plan’s. SNAP’s model of resource usage implies tight control over resources and 
SNAP retains the advantages of strong typing. The low-level bytecode nature 
of SNAP allows efficient execution, frequently without marshalling or unmar- 
shalling costs. Unfortunately, SNAP is much harder to use compared to PLAN. 
In particular, the severe restrictions on flexibility, namely, that branches may 
only go forward, make common programming idioms impossible. For example, 
function calls are impossible to implement straightforwardly. Furthermore, it is 
simply more difficult and tedious to program in a low-level, assembly-like lan- 
guage. 

4 Compilation 

We have seen that PLAN and SNAP have complementary advantages and disad- 
vantages. In fact, what we would really like is an active packet language that is 
flexible, easy to use, safe (in particular in terms of resource usage), and efficient. 
We can achieve this blend by noting that the advantages of PLAN over SNAP 
are mostly notational: PLAN programs are high-level and easy to create, and 
SNAP’s are not. Conversely, SNAP shines in its execution characteristics: it is 
efficient and can only succinctly express computations that can be executed with 
limited resources. These observations suggest the strategy of compiling PLAN 
to SNAP to gain the best of both. In this section, we present the key elements 
of a PLAN to SNAP compiler that we have written. 

PLAN and SNAP have related computational models, in large part because 
SNAP’s design was informed by experience with PLAN. However, SNAP de- 
parts from plan’s semantics in a number of non-trivial ways that complicate 
the process of compiling PLAN programs to SNAP bytecodes. The main dif- 
ficulty arises due to SNAP’s lack of backward branches. We first consider the 
compilation of PLAN features that would normally require backward branches, 
and how we compile these features without them. We then consider the mapping 
between PLAN and SNAP data structures, and finally consider some of PLAN’S 
advanced features. 

4.1 Backward Branches 

To straightforwardly compile function calls and loops, both supported in PLAN, 
would require backward branches. In this subsection, we explain how to compile 
these language features without backward branches, and then consider a novel 
way to simulate backward branches by resending a packet to the current router 
at an earlier evaluation point. 



Function Calls. For stack-based architectures, a call to a function f is compiled 
as a, jump to f ’s address, having first pushed the return address, which is typically 
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the address of the instruction following the jump. Upon its completion, the 
function f pops the return address and jumps to it. Unfortunately, since one 
of these two jump instructions must go backward (either to f initially, or in 
returning to the caller), this approach is prevented by SNAP’s semantics. For 
example, consider the following compilation from PLAN to SNAP: 

f : push 1 ; push arg to print 

print \ print it 
paj 0-pc ; return to caller 

fun f 0 = print (1) 

fun g 0 = f() g: push labl ] push return address 

jmp f-pc ; call f 
labl: paj 0-pc \ return to caller 

This compilation will fail at runtime because the jmp f-pc instruction will com- 
pile to jmp -4, which is not allowed. To address this problem, we have explored 
two compilation strategies for compiling function calls without using backward 
branches: reordering basic blocks and inlining. 

Approach 1: Reordering Basic Blocks. We first considered compiling function 
calls as usual, but then sorted the resulting basic blocks so that all branches go 
forward. For example, we can compile the example program above as before, but 
then sort the basic blocks topologically to prevent backward branches: 

g: push labl \push return address 

jmp f-pc ; calif 

f : push 1 ; push arg to print 

print \ print it 
paj 0-pc ; return to caller 

labl: paj 0-pc \ return to caller 

We have reordered the function g so that its basic blocks are split by the body 
of f . Unfortunately, this approach will not work in general. Consider modifying 
g to call f twice: 

f : push 1 ; push arg to print 

print sprint it 
paj 0-pc ; return to caller 

fun f 0 = print (1) g: push labl \ push return address 

fun g 0 = (f(); fO) jmp f-pc ; call f 

labl: push lab2 \push return address 
jmp f-pc ; call f 
lab2 : paj 0-pc ; return to caller 

It is easy to see that g’s blocks cannot be rearranged such that all branches go 
forward. In particular, the block labl must follow f since it is returned to by f , 
but it must also precede f since it will jump to f . 




Compiling PLAN to SNAP 141 



Approach 2: Inlining One way to more generally handle function calls is to 
eliminate them altogether by inlining. That is, at each call site, we replace a 
given function call with a copy of its body, with the formal parameters replaced 
by the actual ones. In general-purpose languages, inlining cannot, in general, 
remove all calls to recursive functions, since the function’s body contains a call 
to the function itself. However, PLAN does not permit recursive function calls, 
so all function calls can be removed via inlining. 



While inlining solves the problem of backward branches, it can result in code 
bloat, which is particularly worrisome because code occupies space in a network- 
bound packet. In our experience, PLAN programs are often written as one or 
more general functions followed by an ‘entry-point’ function that provides a sort 
of ‘user interface.’ For example, the ‘scout’ packet described in jS] defines a 
function df s that performs a depth-first traversal of the network, with an entry- 
point function startDFS to begin the search. Using inlining, the entire contents 
of dfs will be inlined, and then dfs will itself be inlined inside of startDFS, 
effectively doubling the size of the packet. 



We can reduce code bloat in a number of ways. First, we could ‘prune’ a 
packet to eliminate extra code certain to be unneeded by future computations. 
For example, the startDFS function is only invoked by the sender of the packet; 
therefore, when child packets are transmitted from the sending node, the code for 
startDFS can be eliminated. The PLANet jSj implementation performs pruning 
of this kind . However, while performing a run-time analysis to find dead code 
is reasonably straightforward in PLAN, it is less so in SNAP, and may result in 
excessive computation. 



Second, we could combine inlining with block reordering, using topological 
sort to split basic blocks when possible, and using inlining otherwise. In general, 
within a given function g, we inline all but one of the calls to some function f , 
made either from g or from some function called by g. The reason for this can be 
seen by looking at a control-flow graph between the basic blocks of a program. 
Figure 0shows the control-flow graphs of two PLAN programs. The basic blocks 
of each PLAN function are split by function calls; for example, g’s block that 
precedes a call to f is labeled g-fpre, while the block following it is g-fpost 
(or g-fpre2 in the case that another call to f is made). We can see in the figure 
that each of these graphs has a loop, implying a backward branch is needed. 
Furthermore, the backward branch arises due to a second call to f , either from 
g in the case of the first program, or from g’s child c in the second program. 
There is no way to reorder either program to allow branches to go forward. 



To eliminate the backward branch, the second call to f is inlined. For the 
first program we have (f is inlined as print (1) in the PLAN program): 
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fun f () = print(l); 
fun g 0 = (f 0 ; f 0) 



fun f 0 = print (1) 

fun c 0 = f() 

fun g 0 = (f 0 ; cO) 




Fig. 3. Control-Flow Graph of the Basic Blocks of Two PLAN Programs. 



g: push labl ;push return address 

jmp f-pc ; call f 



fun f 0 = print ( 1 ) 

fun g 0 = (f(); print ( 1 )) 



f : 



push 1 ;push arg to print 

print ; print it 

paj 0 -pc ; return to caller 



labl : push 1 
print 



;push arg to print 
; print it 



In our current implementation, we implement only the basic inlining ap- 
proach, but we plan to alter the compiler to employ the hybrid approach de- 
scribed here. As we will see in SectionEl this can result in large code size savings. 



Finite Loops. Loops require a backward branch to the loop head, and therefore 
also require more clever compilation. In PLAN, the only looping construct is the 
list iterator fold. Intuitively, fold executes a given function / for each element 
of a given list [ 6 i; 62 ; ^n]- The arguments to / include the list element bi and 

the current accumulator value a. The value returned by / is supplied by fold 
as the accumulator to the next iteration. Therefore, /oW(/, a, [ 61 ; 62 ; ...; &«]) is 
equivalent to f{f...f{f{a,bi),b2)...,bn). In the PLAN implementation, fold as 
explained above is called foldl; there is also a version called foldr, such that 
foldr([ai; 02 ; ...; a„], 6 ) is /(ai, /(a 2 , (.../(a„, 6 )...))). 

We can eliminate the presence of fold from PLAN programs by unrolling its 
computation, revealing the calls to the iterator function. However, the number 
of times that the iterator is called depends on the length of the list, which in 
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general cannot be known until runtime. Therefore, the compiler unrolls the loop 
a fixed number of times, as specified by the user. For example: 



fun foo(l) = 

let fun sub(i,j) = (i-j) 
val a = foldl(sub,0,l) 
print (a) 
end 



in 



fun foo(l) = 
let 

fun sub(i, j) = (i “ j) 
val a = let 
val _list = 1 
val _acc = 0 in 
if (_list <> [] ) then 
let 

val _acc = sub(_acc, (hd _list)) 
val _list = (tl _list) in 
if (_list <> [] ) then 



else _acc 
end else _acc 
end 

in print (a) 
end 



The result of unrolling the fold is that calls to the iterator function sub have 
been made explicit. These function calls can then be eliminated by the techniques 
described above. 

There are two disadvantages to unrolling folds. For purposes of discussion, 
assume that u is the number of times the loop is unrolled, and n is the length 
of the list, known at runtime. The first problem is that when u > n, the packet 
program occupies more space than is needed, thus unnecessarily reducing the 
amount of useful payload that can be carried. The second, more serious problem 
occurs when u < n, meaning that the loop was not unrolled enough times to 
process the entire list, potentially leading to incorrect behavior. 

There are a number of ways to deal with the second problem, when u > n. 
First, we could ignore the fact that we did not complete the loop processing. 
This approach makes sense when incomplete list processing leads to degraded, 
but not incorrect service. Alternatively, we could signal an error by throwing an 
exception if we complete processing without reaching the end of the list. This 
exception will either exit the packet scope and halt the packet (see 84. 3L below), 
or it could be handled within the packet by wrapping the fold with a try . . . 
handle block. 

Typical PLAN programs use loops only on short lists of addresses or devices, 
so we expect to be able to choose u < n in most cases. We look at the effect of 
unrolling on code space in Section El 



Simulating Backward Branches. Though not possible with straightforward 
computation, a SNAP program can effectively branch backwards by sending 
a packet to the current router with an earlier entry point, at the cost of one 
resource bound unit. At the SNAP level, a backward branch can be encoded as 
follows: 
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g : . . . ; some instructions 

jmp g-pc \ backwards branch 



here 
getrb 
push -1 
push g 
send 
exit 



some instructions 
local address 
current rb 
reuse entire stack 
entry point 
branch backwards 
kill current packet 



Essentially, a backward branch is a send instruction, having the following argu- 
ments: the local host address, the total current resource bound, the total current 
stack, and the address of desired program location. We need to follow the send 
with an exit instruction, since send creates a new packet, while the current 
packet continues to execute. 

Selective use of backward branches can improve both the problems of code 
bloat and incomplete list processing. For example, we could choose not to inline 
calls to large functions, relying on backward branches instead. Similarly, we could 
compile fold to branch backwards after u unrollings of the iterator function if list 
processing is not yet complete. We hope to experiment with employing backward 
branches in the compiler, though we have not yet done so. With selective use 
of backward branches we can trade resource bound for smaller packet sizes and 
more straightforward semantics. 



4.2 Data- Structures 

In addition to primitive data types like integers, characters, and floats, PLAN 
supports arbitrary-length, homogeneous lists and fixed-length, heterogeneous 
tuples. SNAP supports similar primitive types, as well as arbitrarily-sized blocks 
of pointers which are essentially untyped tuples. Tuples compile one-to-one from 
PLAN to SNAP, and list elements in PLAN map to pairs, in the style of Scheme, 
where the first element contains the data, and the second element contains a 
pointer to the next element, or 0 to terminate the list. 



4.3 Advanced Features 

PLAN supports a number of advanced language features, including exceptions, 
per-packet routing functions, and chunks. In this section, we explain how we 
compile these features. 

Exceptions. PLAN supports exceptions in the style of ML (whose exceptions 
are similar to those in Java). As a small example, consider the following PLAN 
code: 

try 

if X = 1 then raise Exit 
else print (x) 
handle Exit => 

print ("caught Exit") 
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The code raises an exception Exit when the variable x is equal to 1; this ex- 
ception is caught by the surrounding handle block, which prints a message. If 
not predefined by some service, Exit must be declared by the user earlier in the 
program. Exceptions can also be raised by services and network primitives, like 
OnRemote. 

In PLAN, exceptions are treated as strings. For example, when the han- 
dler above catches an exception, it performs a string compare on the excep- 
tion’s name. String compares are not permitted in SNAP, because they are 
non-constant-time operations, so exceptions are essentially integers. As a result, 
we map PLAN exceptions to SNAP integers during compilation. As in PLAN, 
SNAP exceptions can be raised by services and network primitives, like send, in 
that when something goes wrong, an exception is returned rather than a value 
of the expected type. For example, if send fails due to lack of resource bound, 
it will return an exception. Because service exceptions must have global identity 
(and therefore must always map to the same integer), we store the mapping in 
a configuration file used by the compiler. 

We can compile exceptions to SNAP in a straightforward manner for two 
reasons: (1) when using inlining, handlers always occur later than where an 
exception is raised (so raising an exception will not result in a backward branch), 
and (2) the handler for a (potentially) raised exception is known at compile time. 
In most languages, separate compilation does not permit handlers to be known 
compile time, but our compiler essentially translates whole programs. 

We compile exception-raising as follows. First, we observe that exceptions 
are raised in two ways: by the raise command, and by services like hd; each of 
these is handled differently. For the expression raise e, we move the exception 
e into a well-known stack location, and then jump to the closest handler (whose 
address is known at compile time); if no handler exists, we halt the program^ 
For services or primitives that may return exceptions, we check the returned 
value of that service using the ISX instruction to see if it is an exception, and if 
so, jump to a handler. Handle-blocks are compiled to look for the exception in 
the well-known stack location. If the exception does not match the one in the 
handle statement, the code jumps to the surrounding handler. 



Chunks. Translating chunks to SNAP is straightforward: a chunk becomes a 
SNAP 2-tuple consisting of an entry offset (relative to the top of the code block) 
and another tuple containing the initial stack; the code part is elided (it is shared 
with the current packet’s code). The entry offset is not the address of the code 
block itself, but instead some “unmarshalling” code that precedes that block. 
This piece of code expects the tuple containing the arguments to be on top of 
the stack, so that it can extract the arguments into the stack positions expected 
by the actual code block. For example, the following piece of code defines a 
function f , and a function make _f chunk that makes a chunk out of f : 

^ In PLAN, an uncaught exception is turned into a packet that is sent to the sender. 
Adding this feature to SNAP is future work. 
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fun f (i,j) = 
print (i+j ) 
fun make_f chunk () 
If I (1,2) 



f_chunk: pull 0 


copy args tuple 


nth 2 


get the second arg 


pull 1 


copy the tuple 


nth 1 


get the first arg 


store 2 


store in first slot 


f : add 


add the two args 


print 


print the result 


exit 





makeJchunk: push f_chunk 
push 1 
push 2 
mktup 2 
mktup 2 
exit 



; chunk addr 
;arg 1 
; arg 2 

; combine the args 
; make the chunk 



Chunks are used by the network primitives, OnRemote and OnNeighbor, which 
are are described below. 



Network Primitives. OnRemote and OnNeighbor map to SNAP’s send and 
hop primitives. Like their PLAN equivalents, the SNAP primitives require the 
user to specify a destination address, some resource bound to donate, and some 
code to execute. In SNAP, the code part is specified in two parts: an address 
in the code segment, and an initial stack, copied from the top of the current 
stack. Unlike OnRemote, send does not require a routing function argument; this 
is because SNAP programs execute on every hop and perform their own routing. 
Mapping routing functions to SNAP is described below. 

Compiling the network primitives is fairly straightforward. In the case that 
the chunk argument is a literal, the compiler uses the literal’s actual arguments 
and code address as arguments to send or hop. If a non-literal is used, then 
the compiler extracts the chunk preamble address as the code pointer (the first 
element of the pair) and the argument tuple as the stack (the second element) . 



Routing Functions. As discussed, unlike SNAP, PLAN programs do not eval- 
uate on every active router they traverse, but on the evaluation destination only. 
On the intervening PLAN nodes, a packet-specified “routing function” is evalu- 
ated instead, which determines the next hop and forwards the packet there. 

We implement routing functions in SNAP code as a preamble to the main 
program. This preamble checks if the packet has reached its destination, and 
if not looks up the next hop using the appropriate SNAP primitive or service 
function, and forwards the packet. If the packet has arrived at its destination, 
the preamble jumps to the actual entry point, stored on the top of the stack. For 
the def aultRoute routing function, which is the one most often used in existing 
PLAN programs, this preamble is two instructions: the f orw instruction, followed 
by a paj (“pop and jump”). Note that the routing function code is always earliest 
in the packet, to assure that the paj will result in a forward branch. 
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4.4 Discussion 

We now consider some of the ramifications of our compilation strategy. Because 
PLAN programs of length n can consume 0(a:”) time and space (for some a;), 
while SNAP programs of length n are bounded in time and space by 0(n), 
we can infer that compiling PLAN to SNAP must result in forbidding certain 
PLAN constructs or in altering the length of the program. In fact, our compiler 
does both. To eliminate exponential execution times, we expand problematic 
operations, namely function calls and loops, so that they require more space in 
the program. For example, the following PLAN program runs in time 0(2”): 

fun f 0 = (print(l); print(l)) 
fun g 0 = (f 0 ; f 0 ) 
fun h 0 = (g 0 ; g() ) 

That is, the program has n = 3 lines, defines 2 * n = 6 function calls (2 per 
function), but invokes these functions 2” = 8 times dynamically. To compile this 
program to SNAP, we inline each call: 

fun h 0 = (print(l); print(l); (* f() *) 

print(l); print(l); (* f() *) (* g() *) 
print (1); print (1); (* f() *) 
print(l); print(l)) (* f() *) (* g() *) 

The result is that we have expanded the program so that it defines the same 
number of function calls that it invokes dynamically. A similar expansion takes 
place when unrolling fold. 

PLAN programs can also allocate memory on the order of 0(a;”) for some 
X. One way to do this is to structure the program as above, but have the leaf 
nodes (z. e. the calls in f) perform constant-time allocation (say by creating a 
new tuple). As above, we deal with these programs by expansion. However, 
PLAN supports some operators that permit non-constant space allocation. For 
example, the following program allocates 0(2”) memory blocks: 

fun h 0 = 

let val X = [1;1] 
val x2 = X 0 X 
val x3 = x2 0 x2 in 
x3 
end 

That is, we double the length of the list for each line in the program by using 
the append operator 0. We could similarly concatenate the same string to itself 
using the concatenation operator Translations of these operations are not 
straightforward, because they depend on the size of their arguments, so we forbid 
their translation. 

The other problematic primitive in PLAN is the polymorphic equality oper- 
ator =. Checking for equality between two PLAN values is structural, meaning 
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size (B) 


Ratio 


Program 


PLAN 


SNAP 


(SNAP/PLAN) 


deliver 


406 


284 


0.70 


devinf 0 


586 


1624 


2.77 


getNeighbors 


123 


80 


0.65 


getRoutes 


117 


80 


0.68 


multiprint 


810 


1856 


2.29 


ping 


300 


204 


0.68 


ping_pong 


268 


184 


0.69 


pingtime 


556 


384 


0.69 


query_gc 


100 


72 


0.72 


traceroute 


519 


336 


0.65 


Median 


353 


244 


0.69 



Fig. 4. Code Size Experiments. We present the wire format of the given 
PLAN program, that of the SNAP program output by our compiler, and the 
ratio of SNAP size to PLAN size. 



that the contents of compound data-structures, like lists and tuples, are recur- 
sively compared, which necessarily requires non-constant time. SNAP supports 
physical equality, meaning that if two arguments are considered equal only if they 
share the same identity {i.e. occupy the same region in memory). Clearly, physi- 
cal equality implies structural equality, but not the other way around. Therefore, 
during the translation, we map PLAN’S equality operator to the SNAP equality 
operator, but signal a warning if the arguments are polymorphic or are known 
not to be of primitive type. 

5 Experimental Analysis 

SNAP has already been demonstrated to run faster than PLAN H2|. Further- 
more, network transit overheads tend to dominate overall application perfor- 
mance, rather than per-node processing overheads. For example, the compiler’s 
output for the PLAN ping code presented in Figure His only marginally slower 
than the tightly hand-tuned 7 instruction SNAP ping in Figure El As such, 
the most important characteristic of our compiler with respect to performance 
is code size. As code size becomes bigger, an application must pay the over- 
heads of transmitting the code around in its packets. To make matters worse, 
larger code segments leave less room for useful payload, decreasing application 
throughput. 

In this section, we evaluate how compilation affects the resulting wire size of 
several PLAN programs. In most cases, the SNAP programs generated by the 
compiler are 30% smaller than their original PLAN versions. For those PLAN 
programs that are not linear in their resource usage, we see a an increase in code 



size. 
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5.1 Code Size Experiments 

We ran our compiler on ten programs selected from the PLANet |5| distribution; 
they represent a variety of different networking tasks, from simple payload deliv- 
ery (deliver) to information gathering (devinf o , getNeighbors , getRoutes) 
to multicast (multiprint) to simple network diagnostics (ping, traceroute). 

FigureEJcontrasts the resulting packet sizes for the native PLAN wire format 
for each program with the packet size for the corresponding SNAP program 
produced by our compiler. Generally, the resulting SNAP programs for straight- 
line execution are 30% smaller than their PLAN equivalents. For two of the 
programs, devinfo and multiprint, the resulting SNAP programs are larger 
than the PLAN originals. A closer analysis of these pathological cases reveals a 
great deal about the importance of various optimizations in our compiler. 

One first order effect is code bloat from loop unrolling; both programs iterate 
over all devices present on a given node. By default, the compiler unrolls fold 
five times. To understand the effect of unrolling, we parameterized our compiler 
not to unroll all of the fold operators, relying instead on simulating backward 
branches with send, as discussed in the previous section. In this case, the re- 
sulting SNAP program sizes were 728 bytes and 576 bytes respectively, giving 
SNAP/PLAN ratios of 1.24 and 0.71. This restores the typical 30% improvement 
for the multicast example, but not for devinfo. 

If we furthermore apply the topological sorting of basic blocks as discussed 
in Section EH we can trim the resulting size of the devinfo SNAP program to 
544 bytes, resulting in a ratio of 0.92. Now at least, the SNAP program is smaller 
than the original PLAN version, but not by much. A quick perusal of the SNAP 
program reveals that most operations are (often redundant) stack management 
operations. Hand-tuning to eliminate cases like a push immediately followed by 
a pop results in a new program size of 476 bytes (a ratio of .81), much closer to 
the usual. 

In the end, however, the compiled code must respect SNAP’s linear resource 
restrictions. As they are, PLAN’S multiprint and devinfo programs consume 
non-linear resources, since they iterate over the list of devices. Naive compilation 
results in code blowup proportional to a (conservative) bound on that iteration. 
We have shown that using backward branches, at the cost of one resource bound 
per branch, essentially results in packet sizes proportional to the original PLAN 
packets. In other words, we can trade resource bound for compactness. 

Of course, resource bound cannot be used to freely consume resources. In 
SNAP, resource bound is limited to 256 units per packet, which must be shared 
among that packet’s progeny. The result is that the compiler must use resource 
bound for backward branches only sparingly, or run the risk of losing packets 
when they run out. 

6 Future Work and Conclusions 

This work is part of larger project to build a “second-generation” active inter- 
network, called FASTnet. While FASTnet currently uses PLAN as its packet 
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language, our goal is to use SNAP instead, relying on the compiler to seamlessly 
convert PLAN programs written by users into SNAP programs used by the net- 
work. FASTnet is implemented in TAL/Popcorn cm, and takes advantage of 
dynamic code updating Cl to allow the system to evolve dynamically over time. 
Adding a SNAP byte-code machine into FASTnet should be straightforward. 

To seamlessly incorporate SNAP into FASTnet requires more work on the 
compiler. The most important requirement is to improve the compiler’s gener- 
ated code quality, particularly in terms of code compactness. As we mentioned 
in Section 0 there are a number of optimizations that could net significant re- 
ductions in code size. We have already identified that topological sorting of basic 
blocks and careful arrangement of subexpressions to reduce stack reordering op- 
erations could result in substantial savings. 

We also need to gain more experience in using PLAN programs with non- 
linear resource consumption, such as the devinfo program from the previous 
section. Furthermore, we need to understand better what strategy to apply when 
a loop is not unrolled enough times to complete its task. We ultimately want 
the compiler to use reasonable heuristics to strike a balance between consuming 
resource bound and unrolling loops. 

While PLAN is flexible and highly usable, SNAP is resource-safe and efficient. 
We have shown in this paper that compiling PLAN to SNAP allows us to gain 
the benefits of both languages, but requires us to overcome some significant 
challenges to map PLAN to SNAP’s limited execution model. Initial performance 
measurements show that for simple programs, the compiler produces compact 
SNAP code that is approximately 30% smaller than the original PLAN code; 
programs that use iteration result in larger code sizes, owing in general to SNAP’s 
resource usage model. 
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Abstract. The inclusion of innovative services in commercial networks is a 
burdensome task which frequently encounters resistance from Network 
Operators. Opening up the network is a prerequisite for the Active & 
Programmable Network paradigm to succeed. In this paper we present a novel 
network model which addresses three critical points to achieve that goal: 
network security and safety, service management and high performance. We 
show that excessive virtualization of network resources penalizes performance 
and we introduce programmable hardware at the core of our model. We also 
introduce a two-tier security checking architecture which frees network nodes 
from the most heavyweight tasks, improving performance. Our single point of 
service admission permits strict security control. Lastly, the separation between 
service introduction and service management increases network flexibility and 
permits the smooth integration of other network architectures in our framework. 
We also present the Octopus Open Gateway architecture, which shall support 
our network model. 



1 Introduction 

Although networking is a highly dynamic field, a broad consensus exists regarding 
the difficulties of transporting innovative concepts into real networks. The 
deployment of new services or the introduction of new protocols is a slow and 
burdensome task, in which the Network Operator can be seen as the bottleneck. The 
reasons are manifold: On the one hand, his traditional sources of revenue, the 
transport of data and the management of the network, are becoming a commodity. 
Increasing competition is driving prices down while the requirements posed by clients 
are always increasing. More functionality, more bandwidth and beher service imply 
costly investments to keep up with the latest technology, which quickly becomes 
obsolete. On the other hand, the heterogeneity and complexity of modem networks 
makes its management and configuration increasingly complex. It is very difficult to 
foresee the implications of introducing a new service or protocol for the correct 
behavior of the network as a whole before testing it on the field. As a consequence, 
the Network Operator is discouraged (for econoiuic as well as technical reasons) from 
expanding the functionality of its network and especially from granting third-parties 
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access to its management. Its main concerns to “open up” its infrastructure are thus 
the security, safety and performance of the network. 

The Active and Programmable Network (A&PN) communities, on their side, 
defend the idea that to foster innovation two requisites are necessary: Allowing 
Service Providers direct access to, and (partial) control of, the nodes and making 
network nodes programmable. The Active Network community goes even further by 
introducing the packet as the main network control and configuration unit: Incoming 
packets shall trigger the activation, download or reconfiguration of services inside the 
nodes. Several proposals in these fields have shown the feasibility of the concepts as 
well as some of their advantages. Nevertheless, it is our claim that for the broad 
dissemination of the A&PN paradigm three problems remain unresolved: A 
satisfactory security model, a general service management model and high 
performance. 

Existing proposals either do not provide a general framework addressing the 
security concerns of Network Operators or do so by developing heavy security 
architectures that strongly penalize performance. Although there is some work in 
progress trying to surmount this conflict, we believe that no existing architecture has 
achieved it so far in a completely general way. 

The second unresolved problem is performance. On the one hand, sharing control 
and communication network resources among several parties, as A&PN defends, 
needs coordination in the form of middleware actors, resource managers and the like. 
All this additional elements have a negative effect on performance. On the other hand, 
as already mentioned to fulfill the security requirements of this open networks 
burdensome procedures are needed. We nevertheless claim that it is the exclusion of 
open hardware what will most negatively affect performance in the long run. By 
emphasizing an abstract view of network infrastructure, present approaches prevent 
service developers from taking direct advantage of the node hardware. There are 
many applications that would profit from hardware support. The increasing speed of 
the networks and the tremendous development of programmable hardware show us 
the necessity and the feasibility of using hardware-software co-design of new services 
to successfully support network programmability. 

Lastly, open network programming interfaces (ONPIs) of some sort are at the core 
of most proposals. It is claimed that they provide a foundation for service 
programming and the introduction of new network architectures. Beyond this 
undeniable fact, the problem of evolving ONPIs remains. Since it is impossible to 
foresee all the ways in which networking might evolve, programming interfaces, if 
not very carefully designed, are in themselves a restriction to innovation. They 
constrain the ways in which service creation and management might develop. 

In this paper we introduce the Octopus Open Network Model, which we believe 
addresses the three points stated above. We have developed a node architecture that 
includes not only a programmable software environment for Service Providers, but 
also a programmable hardware platform. This should strongly improve performance. 
We address the problem of the evolution of network programming interfaces by 
clearly differentiating service introduction from service management. We moreover 
keep the standardization of the latter at a minimum. We also use the introduction of 
our Trusted Development Servers (TDSs) to structure our security architecture in two 
tiers, placing all heavyweight mechanisms at the TDS. This should also benefit per- 
formance at the nodes. 
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The rest of the paper is structured as follows: In chapter 2 we summarize some of 
the most relevant previous work. In chapter 3 our network model is presented, while 
in chapter 4 our node architecture is described. We conclude the paper in chapter 5 by 
summarizing our contributions and presenting some future topics of research. 



2 Previous Work 

Research in the areas of Active and Programmable Networks has already produced a 
broad set of proposals. We will not exhaustively examine them here, but we will 
concentrate on the most relevant ones for our work instead. For a good survey on this 
area we refer the reader to [3] and [9]. 

The inclusion of programmable hardware in active node architectures has been 
rare. The idea is nevertheless present, as has been shown in the work of Hadzic et al. 
at the University of Pennsylvania [7] and Decasper et al. at Washington University at 
St. Louis [4] and recently also in the work of Dr. Zitterbart's group at the University 
of Braunschweig [8]. The P4 architecture, developed under the Protocol Boosters 
project [6] at UPenn consists of a pipeline of FPGAs interconnected by a switching 
array and controlled by a special Controller Unit. The Controller decides to which 
FPGA an incoming packet should be sent and also permits on-the-fly 
reprogrammability by dynamically separating any FPGA from the pipeline. With their 
prototype, Hadzic and his colleagues showed the feasibility of the idea, although in a 
very restricted form, since their platform was not designed to support several services 
concurrently or to dynamically select which packets should be processed by a certain 
service. 

The FHiPPs platform developed at the University of Braunschweig presents a 
much more advanced structure. It also includes several FPGAs interconnected by a 
switching matrix, plus an external processor, a DSP and ATM interfaces. The 
hardware is accessible from any application via so-called Happlets, which extend the 
functionality of classical device drivers to manage the reconfigurability of the 
platform. In itself a very promising design, it is nevertheless not clear how different 
packet streams shall request processing by different services multiplexed onto the 
same platform or how a chosen packet stream should access several services in a row. 
Moreover, the design is in itself monolithic, without expansion possibilities. 

The ANN design from WashU is the most comprehensive of all, including all 
aspects of the node architecture. Their particular hardware is composed of a set of 
ANPEs (Active Network Processing Engines) interconnected by an ATM switch core. 
Every ANPE includes a CPU, a FPGA and some memory. These elements are 
controlled by the node OS, which can reprogram any of them on-the-fly. Scalability is 
provided by means of attaching more ANPEs to the switch. The only limitation is in 
the surveillance of the shared resources in hardware. As we will discuss later, 
software control of hardware resources is not enough in presence of malfunctioning or 
greedy service designs. 

On the software side, we borrow heavily from the experience gained by the groups 
at UPenn and WashU, plus the Tempest framework at the University of Cambridge 
[10]. The Switchware project at UPenn [1], [2] attempts to balance the flexibility of 
programmable networks and the security requirements stated in previous chapters. 
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The main elements of their architecture are active packets, dynamically loadable 
programs called Switchlets and active nodes. Active packets are written in a safe 
language called PLAN. In order to ensure safety the actions that active packets can 
realise are very restricted. When more complex tasks are needed, active packets can 
call Switchlets, which are programmed in CAML. This language supports formal 
methodologies to prove security properties of the code. This code segments are loaded 
out-of-band into the node. At the lowest layer, the Secure Active Network 
Environment (SANE) ensures the integrity of the entire environment. 

The DAN architecture at WashU [5] sees services as a set of functions that are 
called by incoming packets. A packet might call several functions, which are then 
daisy-chained to process the packet in a row. If a packet needs a function which is not 
present in the node at the moment, it is downloaded from a well-known code server. 
This introduces additional delay, but permits to concentrate the most heavyweight 
security checks in those servers, where new modules are first stored. The server 
authenticates itself when downloading a new program into a node. The module itself 
can also be digitally signed. We elaborate on the idea of code servers to develop our 
Trusted Development Servers (see chapter 3). 

The most characteristic item of the Tempest framework is the definition of several 
parallel control architectures over the same infrastructure. This control architectures 
are furthermore customizable on a per-service basis with the help of mobile code. The 
whole concept rests upon the abstraction of node resources, which permits to share 
them in a transparent way among coexisting control architectures, under the common 
surveillance of a resource divider called Prospero. This view of virtual networks over 
the physical infrastructure are at the core of our Logical Overlay Networks (LONs). 
The Tempest is nevertheless restricted to ATM networks and we will contend that 
their degree of resource abstraction penalizes performance. 



3 The Octopus Open Network Model 

The A&PN paradigm implies that the Service Provider is going to become the 
principal actor in the networking world. He will provide the content to give added 
value and differentiation to any network. Furthermore, it is the Service Provider itself 
who is going to manage its services. The Network Operator will find itself reduced to 
a commodity provider, which in this case means providing connectivity, bandwidth 
and a set of general management services and surveillance of the network. Certain 
basic QoS guarantees also fall into this category. 

This new network model, then, foresees the Service Provider as the Operator of its 
own Logical Overlay Network (LON), formed by all the (node and network) 
resources used by its services. Many such Service Operators will then be multiplexed 
over the same physical infrastructure. This model implies a new set of relationships 
among networking actorsQsee Fig. 1). 



1 Although regulators (governmental agencies, standardization bodies, etc.) certainly influence 
all actors, their role will not be further explored here. 
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Fig. 1. The Relationship among Actors. 



The client will be dependent on the Network Operator for his connectivity and 
basic transport of data and on the Service Provider for the content. At the other 
extreme of the chain, the Service Developer depends on the capabilities of the 
network, which fall under the control of the Network Provider, to develop new 
services. But he also must adapt his design to the necessities and business model of 
the Service Provider. 

Nowadays the control and management of the network still lies in the hands of the 
Network Operator. In the end, it is the way and the extent in which the Network 
Operator will open its network what is going to set the transition speed to, and the 
ultimate success of, this new environment. We claim, as stated above, that innovation 
is being slowed by the rigidity of the networks. In order to accelerate this transition, a 
model is needed that guarantees the Network Operator the ultimate control of its 
network while letting Service Operators freedom to innovate. 

The main elements of the Octopus Open Network (OON) Model can be seen in 
Fig. 2. 

First among those is the Octopus Open Gateway (OOG), which will be analysed in 
chapter 4. These nodes are shared among many Service Operators. The union of all 
resources used by any one Service Operator is called a Logical Overlay Network 
(LON)[] There are two access points to a LON. The first one is the Trusted 
Development Server (TDS). It represents the interface to introduce new services into 
a LON. The interface to manage those services, once installed, is directly controlled 
by the Service Operator (dotted line on Fig. 2). 



^ We avoid the common terms “virtual node” and “virtual network” because we find them 
flawed. As so often in the networking world, we consider here simply an abstraction, a 
logical view of a physical infrastructure. There is nothing virtual about it. 
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Fig. 2. The Octopus Open Network Model. Main Elements. 

The OOG provides a software and hardware programmable platform. It is our goal 
to foster the quick development and installation of new services by improving 
portability and taking the Network Operator off the management path. We also foster 
performance by opening the node hardware to the Service Developer. We 
acknowledge, though, that network heterogeneity is here to stay. Hence, complete 
portability, especially for hardware modules is impossible to achieve. We shall 
elaborate on this shortly. 

The development and insertion of a new service is as follows: 

First, the Service Developer (possibly under contract of a Service Operator) 
designs a new service. We foresee the use primarily of platform-independent 
languages, for the software parts of the service (with Java as an example) as well as 
for the hardware parts (e.g. VHDL). Since the architecture of every node is different 
and does not fall under control of the developer, only the most abstract description of 
the hardware modules of a service can be kept portable. In a second step the 
developer must adapt his hardware modules to the concrete platforms where it is 
going to run. 

To introduce the service in the LON, the code is then sent to the TDS. We foresee 
the deployment of at least one TDS per Equipment Provider and Network Operator. 
The role of the TDS is to check the rightness of the code by formal methods and to 
apply the most heavyweight security checks on the new service and its provider. The 
second role of the TDS is the integration of the new service in the configuration of the 
nodes. This mainly consists in communicating to the node the resource requirements 
of the new service, in order to check if they can be satisfied. This resources are 
mainly CPU time, memory space and bandwidth. Since a service can be formed by 
software and hardware modules, the task of resource monitoring at the node is 
performed by the NodeOS as well as by our Hardware Manager, described in chapter 
4. Since the Hardware Manager is directly implemented in form of VHDL code in our 
FPGAs, its reconfiguration is in fact performed at the TDS, which integrates the 
resource requirements in the new configuration of our hardware platform. For that. 
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the TDS must either have a local copy of the configuration of its associated nodes, or 
upload it as needed. It would be advisable to have a hierarchy of TDSs in the 
networks, in order to more efficiently distribute the load and to prevent a single point 
of failure. 

The TDS realizes the adaptation of new services to the nodes. A TDS is in charge 
of this for every Service Operator using a certain physical infrastructure. Thus, strong 
security measures inside the TDS are needed in order essentially to prevent access to 
foreign code. We envision the deployment of secure OSs to accomplish that. 

The adapted service is then downloaded into the relevant nodes. In case that a LON 
consists of nodes of different Equipment Providers, this adaptation must be done in 
parallel at their respective TDSs. 

As we see, the role of the TDS is twofold: First, it implements the secure interface 
to the introduction of new services. Doing that, it takes the burden of most security 
checkings off the node, thus improving performance. The node must simply 
authenticate TDS and code prior to accepting a download. Its second task is adapting 
the code to its environment and communicating the resource needs of the new service 
to the node. This precludes the need for costly (in terms of money and performance) 
additional processing power inside the nodes. 

Once a new service has been deployed, its management falls entirely in the hands 
of the Service Operator. The resources that the service is going to use have already 
been set at the TDS and are hard-coded inside the NodeOS and Hardware Manager. 
This two entities monitor the behavior of all services installed in a node, enforcing 
their correct behavior. Hence, there is no need to restrict the ways in which the 
Service Operator manages its service. No restrictive interface is needed. This 
characteristic of our model can be seen by some as a drawback, since it provides the 
finest possible granularity of service control. Some Service Operators might prefer to 
have a certain management support. Our model does not preclude specialized compa- 
nies from providing those services. For those other Service Operators which prefer to 
keep absolute control over their services, no restrictions are imposed. 

We thus avoid the standardization of Network Programming Interfaces (NPIs), 
since we believe that no NPI can foresee all possible technological evolutions. Hence, 
any NPI represents a potential restriction to innovation. That is why we substitute this 
concept by our Service Admission Interface (SAI), implemented in our TDS. 



4 The Octopus Open Gateway Architecture 

The node architecture that we have developed is presented in Fig. 3. 

It consists of three main blocks: A management CPU, a Basic Hardware Platform 
(BHP) and a Universal Hardware Platform (UHP). 

The BHP implements the basic communication functionality, i.e. it is a “plain” 
router. Those incoming packets that do not need any kind of special treatment will 
simply be forwarded in a traditional fashion by the BHP. Nevertheless, at different 
points of the processing path (after packet classification, route lookup, etc.) there 
exists the possibility to forward the packets to the UHP via an internal backplane. The 
UHP presents the programmable hardware platform that Service Developers can use 
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for their designs. The separation between BHP and UHP guarantees backward 
compatibility, since we do not force any change in the format or function of packets. 

Universal Hardware Platform (UHP) 




Backplane 




Basic Hardware Platform (BHP) 



Fig. 3. The OOG Hardware Architecture. 



The UHP is composed of several modules, so-called UHP Modules (UHPM) 
interconnected by another backplane. Every UHPM initially contains an FPGA and 
some memory. No further interfaces are needed, since the communication with the 
CPU and the BHP is controlled by a special unit in the UHP. The scalability of our 
platform is guaranteed by the modularity of the design. More UHPMs can be added at 
any time. On the other hand, since we specifically separate BHP and UHP, the later 
can be substituted for a more powerful one if needed. 

The CPU supervises the functioning of the node and provides the software 
environment where new services will be inserted. Lately, the possibility to integrate 
microprocessor cores directly in the FPGAs has arisen. This seems very attractive, for 
it allows the integration of both parts of a service (software and hardware) in a 
common platform, easing the interchange of information between them. We leave this 
option for further study. 
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Fig. 4. The Service Structure. 

The structure of a service can be seen in Fig. 4. As already mentioned, a service 
can be composed of several software modules (SMs, a.k.a. applications) and several 
hardware modules, which directly run on the UHPMs. They communicate by means 
of specialized device drivers (DDs). In this area we are investigating the possibilities 
of the Happlets introduced in the FHiPPs project [8]. To guarantee the success of our 
Open Gateway approach, three conditions have to be met: 

• Isolation between services 

• Isolation between Service Operators 

• Protection of the node against both services and Service Operators 

The sharing of resources in a transparent way implies that the QoS level agreed 
upon between the TDS and the Service Operator has to be maintained for all services 
at all times. That is, neither the addition or removal of services, nor their normal 
activity can degrade the quality or performance of other services. Furthermore, the 
node itself must be protected against service malfunctions or malicious 
implementations. The monitoring of the QoS and security levels is shared between the 
NodeOS and a special hardware module, called the Hardware Manager. In software, 
the concept of safe execution environments, sandboxes, etc. is widely known and its 
usefulness accepted. We believe that the ultimate responsibility of QoS and security 
monitoring in software can only be taken by the NodeOS. We intend to explore the 
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possible use of the security architecture developed inside the SwitchWare project [1] 
in our model. 

The proposals that include programmable hardware put forward its control by the 
NodeOS or more specifically by the device driver. We do not think that this approach 
succeeds in guaranteeing isolation among services. The access to shared hardware 
resources, like common buses and memory blocks can not be conveniently controlled 
by the NodeOS, since this details are hidden from it. A greedy service could block the 
whole UHP by constantly sending data over the common bus or by steadily writing to 
memory. A malicious implementation could even try to selectively access or damage 
another service's resources or data. To solve this situation we have introduced the 
figure of the Hardware Manager. Its main task is the monitoring of those common 
resources and the enforcement of QoS agreements. It basically is an enhanced hard- 
ware scheduler. Its configuration is updated by the TDS every time a new service is 
integrated, to take into account the new resource distribution. It is also the task of the 
TDS to decide, according to the existing load in the UHP, if the new requirements can 
be satisfied. 

While most proposals do not specify how packets are directed to their respective 
service instances, we specifically rely on packet classification for that task. The most 
usual business model foresees the client requesting a certain treatment for his traffic 
from his Service Operator (similar to the actual SLAs). It is then the Service Operator 
who explicitly configures his nodes to direct the client's traffic to the appropriate 
service instance. We support this model by leaving it to the Service Operator to 
configure the packet classification engine accordingly. Although we regard the 
concept of packets themselves signaling which service they require as less realistic, 
we do not preclude it. As stated underneath, such an approach can also be imple- 
mented inside our model. 

Our proposal can be taken for a relatively conservative programmable network 
architecture. In reality, though, we try to present a framework inside of which a 
variety of network architectures can be implemented. We only limit the way in which 
services are integrated in the LON, in order to control the resources that it is going to 
use and to apply certain security measures. But we leave absolute freedom to realize 
any kind of service, including any other network architecture. As an example, a 
capsule-based approach could be realized by introducing a service which allows 
active packets to trigger certain functions inside this service. Our framework certainly 
forbids the dynamic download of new services on-the-fiy. Nevertheless, most active 
packet approaches concede that only limited functionality can be directly transported 
inside the packets or downloaded on-demand and that bigger programs should be 
downloaded out-of-band. That fits nicely in our vision. 

A safe NodeOS, the Hardware Manager and the TDS are the mechanisms which 
guarantee security, safety and QoS compliance in our model. We summarize the 
OOG's interface structure in Fig. 5. 
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Octopus Open Gateway 






SMI: Service Management Interface 
SwMl: Software Management Interface 
HwMl: Hardware Management Interface 
ST-SAI: Service Operator to TDS SAl 
TO-SAI: TDS to OOG SAl 
SAl: Service Admission Interface 



Fig. 5. The OOG’s Interfaces. 

The access to the network is controlled by means of the Service Operator to TDS 
Service Admission Interface (ST-SAI). Once validated, the service is downloaded into 
the OOG through the TDS to OOG Service Admission Interface (TO-SAI). The 
Service Operator can control its service through the Service Management Interface 
(SMI). The communication between hardware and software is controlled by the 
Software and Hardware Management Interfaces (SwMI and HwMI, respectively). 
Both are supervised by the NodeOS. It falls under the responsibility of the Service 
Operator to implement its own SMI. 



5 Conclusions and Further Work 

It is commonly accepted that implementing innovative concepts in commercial 
networks is a difficult task. The Network Operator is mostly responsible for that 
situation. Technical and economic reasons discourage him to open his network to 
third-parties. As a result, the infrastructure as well as the management of network 
services is kept under the absolute control of the incumbent Operator. In order to 
overcome this situation so that the Active and Programmable Network paradigm can 
succeed, a network model is needed that: 

• Preserves the security and safety of the network 

• Does not degrade its performance 

• Facilitates service creation, deployment and management by third-parties (Service 

Operators) 
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In this paper we introduce the Octopus Open Network Model. It represents a 
framework inside of which a variety of network architectures can be realized. We 
address the most critical points to “open up” the network by: 

• Presenting a single point of entrance to service integration, where heavyweight 

security mechanisms can be applied, at the systems as well as at the programming 
level. By only leaving lightweight security checks to the node (service 
authentication) we also boost performance. 

• Introducing programmable hardware at the core of service design to enhance 

perfomiance. We also present our own vision of a Universal Hardware Platform 
to support these ideas. 

• Substituting Network Programming Interfaces by our Service Admission Interface, 

in the process accepting that innovation is not foreseeable. We thus leave it to the 
Service Operator to define its own management interfaces. 

We have also presented the architecture of the Octopus Open Gateway, which shall 
support our network model. Although we thankfully acknowledge the influence of 
previous work in our design, we introduce several innovations like the Hardware 
Manager, to monitor resource usage in our UHP, and the design of the UHP itself 
At the moment of writing, we are beginning the implementation of a prototype 
gateway. A first implementation of the UHP is almost ready. We are also working on 
the realization of the Hardware Manager and implementing a service that serves as 
proof-of-concept. We are following a hardware-software co-design approach by 
dividing the service in two main modules, one of them implemented in Java and the 
other one in VHDL. 
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